Snapshot and replication of a multi-stream application on multiple hosts at near-sync frequency

ABSTRACT

Storage access requests are received from one or more applications. Multiple servers update multiple virtual disks as directed by the storage access requests. The virtual disks store data that is write order dependent across the virtual disks. Logs are associated with the virtual disks. Information associated with each storage access request is stored in one of the logs. A cycle of log switching is performed. A write order consistent tracking coordinator coordinates the log switching with agents at the servers to maintain request ordering. Replication coordinators coordinate the application of the switched-out log files from primary storage to replica storage, creating a write-order consistent point on the replica side matching the primary side, and providing for failure resiliency regarding transfer of the logs. The replication logs may be received individually on the replica side from the servers on the primary side to enable highly scalable parallel/simultaneous transfers of the logs.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a reissue of U.S. application Ser. No. 14/495,685,filed Sep. 24, 2014, entitled “SNAPSHOT AND REPLICATION OF AMULTI-STREAM APPLICATION ON MULTIPLE HOSTS AT NEAR-SYNC FREQUENCY,” nowU.S. Pat. No. 10,073,902, issued Sep. 11, 2018.

This application is related to the following U.S. patent application,which is incorporated by reference herein in its entirety:

U.S. patent application Ser. No. 13/564,449, titled “Request OrderingSupport When Switching Virtual Disk Replication Logs,” filed Aug. 1,2012.

BACKGROUND

As computers have become more commonplace, individuals and businesseshave become increasingly reliant on reliable computer systems. Recoverymechanisms can be implemented to protect against various malfunctions,such as power failures, hardware and/or software errors, and so forth.The operating system and/or other control programs of a computer canprovide various recovery mechanisms.

Storage replication may be used to protect against the loss of storeddata. According to storage replication, multiple storage units may beused to redundantly store the same data. In this manner, redundantcopies of data are maintained in case of failure of one of the storageunits. Various types of storage replication exist. For example,synchronous replication may be used, which guarantees that any write ofdata is completed in both primary and backup (or “replica”) storage.Alternatively, asynchronous replication may be used, where a write ofdata is typically considered to be complete when it is acknowledged byprimary storage. The data is also written to backup storage, butfrequently with a small time lag. Thus, the backup storage is notguaranteed to be synchronized with the primary storage at all times.

High-availability clusters (also known as HA clusters or failoverclusters) are groups of computers that frequently use asynchronousstorage replication. An HA cluster uses redundant computers in groups orclusters that provide continued service when system components fail.Without clustering, if a server running a particular applicationcrashes, the application will be unavailable until the crashed server isfixed. HA clustering remedies this situation by detectinghardware/software faults, and immediately restarting the application onanother system without requiring administrative intervention, a processknown as failover. HA clusters are often used for critical databases,file sharing on a network, business applications, and customer servicessuch as electronic commerce websites. HA cluster implementations attemptto build redundancy into a cluster to eliminate single points offailure, including using multiple network connections and data storagewhich is redundantly connected via storage area networks.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

Methods, systems, and computer program products are provided for writeorder consistent tracking. Storage access requests, such as writerequests, are received from one or more applications (e.g., adistributed application). Storage request processing modules at multipleservers update multiple virtual disks as directed by the storage accessrequests. The virtual disks are primary storage that store data that iswrite order dependent across the virtual disks. Logs are associated withthe virtual disks. Replication management modules store informationassociated with each storage access request in one of the logsassociated with the virtual disks. A cycle of log switching is performedfor the logs. A write order consistent tracking coordinator coordinatesthe log switching with agents at the servers to maintain requestordering. A replication coordinator coordinates the application of theswitched-out log files to replica storage, to synchronize the replicastorage with the primary storage.

Further features and advantages of the invention, as well as thestructure and operation of various embodiments of the invention, aredescribed in detail below with reference to the accompanying drawings.It is noted that the invention is not limited to the specificembodiments described herein. Such embodiments are presented herein forillustrative purposes only. Additional embodiments will be apparent topersons skilled in the relevant art(s) based on the teachings containedherein.

BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES

The accompanying drawings, which are incorporated herein and form a partof the specification, illustrate embodiments of the present applicationand, together with the description, further serve to explain theprinciples of the embodiments and to enable a person skilled in thepertinent art to make and use the embodiments.

FIG. 1 illustrates an example system implementing the request orderingsupport when switching virtual disk replication logs in accordance withone or more embodiments.

FIG. 2 illustrates another example system implementing the requestordering support when switching virtual disk replication logs inaccordance with one or more embodiments.

FIG. 3 illustrates an example architecture for implementing the requestordering support when switching virtual disk replication logs inaccordance with one or more embodiments.

FIG. 4 is a flowchart illustrating an example process for implementingrequest ordering support when switching virtual disk replication logs inaccordance with one or more embodiments.

FIG. 5 is a state diagram illustrating example states for implementingrequest ordering support when switching virtual disk replication logs inaccordance with one or more embodiments.

FIG. 6 is a flowchart illustrating an example process for implementingrequest ordering support when switching virtual disk replication logs inaccordance with one or more embodiments.

FIG. 7 shows a block diagram of a system that includes multiple virtualdisks that store write order dependent data, and that implements theswitching of virtual disk replication logs in a manner that maintainswrite order dependency across virtual disks, according to exampleembodiments.

FIG. 8 shows a flowchart providing a process for the switching ofvirtual disk replication logs in a manner that maintains write orderdependency across virtual disks, according to an example embodiment.

FIG. 9 shows a block diagram of a write order consistent trackingcoordinator, according to an example embodiment.

FIG. 10 shows a flowchart providing a process for initiating logswitching, according to an example embodiment.

FIGS. 11 and 12 show block diagrams of a system of using lock files tocoordinate log switching, according to example embodiments.

FIG. 13 shows a flowchart providing a process for coordinating a stageof log switching, according to an example embodiment.

FIG. 14 shows a process for using control codes to coordinate logswitching, according to an example embodiment.

FIG. 15 shows a flowchart providing a process for using control codes tocoordinate a stage of log switching, according to an example embodiment.

FIG. 16 shows a block diagram of a system that includes replicationcoordinators to coordinate log switching and the application of virtualdisk replication logs to replica storage, according to exampleembodiments.

FIG. 17 shows a flowchart providing a process for coordinating logswitching and the application of virtual disk replication logs toreplica storage, according to an example embodiment.

FIG. 18 shows a block diagram of an example computing device that may beused to implement embodiments.

The features and advantages of the present invention will become moreapparent from the detailed description set forth below when taken inconjunction with the drawings, in which like reference charactersidentify corresponding elements throughout. In the drawings, likereference numbers generally indicate identical, functionally similar,and/or structurally similar elements. The drawing in which an elementfirst appears is indicated by the leftmost digit(s) in the correspondingreference number.

DETAILED DESCRIPTION I. Introduction

The present specification and accompanying drawings disclose one or moreembodiments that incorporate the features of the present invention. Thescope of the present invention is not limited to the disclosedembodiments. The disclosed embodiments merely exemplify the presentinvention, and modified versions of the disclosed embodiments are alsoencompassed by the present invention. Embodiments of the presentinvention are defined by the claims appended hereto.

References in the specification to “one embodiment,” “an embodiment,”“an example embodiment,” etc., indicate that the embodiment describedmay include a particular feature, structure, or characteristic, butevery embodiment may not necessarily include the particular feature,structure, or characteristic. Moreover, such phrases are not necessarilyreferring to the same embodiment. Further, when a particular feature,structure, or characteristic is described in connection with anembodiment, it is submitted that it is within the knowledge of oneskilled in the art to effect such feature, structure, or characteristicin connection with other embodiments whether or not explicitlydescribed.

Numerous exemplary embodiments are described as follows. It is notedthat any section/subsection headings provided herein are not intended tobe limiting. Embodiments are described throughout this document, and anytype of embodiment may be included under any section/subsection.Furthermore, embodiments disclosed in any section/subsection may becombined with any other embodiments described in the samesection/subsection and/or a different section/subsection in any manner.

Request ordering support when switching virtual disk replication logs isdiscussed herein. Storage access requests, such as write requests, arereceived from a virtual machine. A storage request processing moduleupdates one of multiple virtual disks as directed by each of the storageaccess requests. Additionally, a replication management module storesinformation associated with each storage access request in one ofmultiple logs. The logs can be transferred to a recovery device atvarious intervals and/or in response to various events, which results inswitching logs so that the replication management module stores theinformation associated with each storage access request in a new log andthe previous (old) log is transferred to the recovery device. Duringthis switching, request ordering for write order dependent requests ismaintained at least in part by blocking processing of the informationassociated with each storage access request.

Various embodiments are discussed herein in terms of virtual machines.Virtualization generally refers to an abstraction from physicalresources. Hardware emulation involves the use of software thatrepresents hardware that the operating system would typically interactwith. Hardware emulation software can support guest operating systems,and virtualization software such as a hypervisor can establish a virtualmachine (VM) on which a guest operating system operates. Much of thedescription herein is described in the context of virtual machines, butthe techniques discussed herein are equally applicable to physicalmachines that do not employ virtualization.

To enable recovery of a device in the event of a malfunction, theinformation associated with that device is provided to a recoverydevice. In the context of virtual machines, a base replication can beprovided, and updates or changes to that base replication can beprovided as the virtual machine is running on its primary device.

The techniques discussed herein support systems in which differencingdisks or other similar mechanisms are not needed to provide virtualstorage replication and virtual machine recovery. In one or moreembodiments, one or more logs (e.g., log files), also referred to asreplication logs, are created that capture changes being made to astorage device, including a virtual disk. In one virtual machineembodiment, the logs can be created by preserving duplicates of changerequests that are queued for inclusion into the virtual disk. The logprocessing and updating can be performed in parallel with the processingthat updates the virtual disk, such that replicated data is createdwithout additional latencies, and the logs can be prepared in such a waythat it can be easily transferred to a recovery device(s) while limitingthe impact on the Input/Output Operations Per Second (IOPS) to therunning workload. Thus, while the techniques discussed herein may beused in addition to technologies such as differencing disks when usedfor other purposes, replication may be effected without the existence ofany differencing disks in accordance with the disclosure.

In one or more embodiments, a virtual machine's write requests that aredestined for a virtual disk are copied to a log data structure, such asa log queue. The log entries are taken from the queue and processed intoa log. Writes to the log can be accumulated in memory, versus storagesuch as a virtual disk, disk or other physical storage. The writerequest information may be accumulated in memory before writing to thephysical disk in order to, for example, reduce the impact on workloadperformance and response times inside the virtual machine. The writes tothe log may be coordinated with the writes to the virtual disk file(e.g., virtual hard disk or “VHD” file) to, among other things,facilitate application-consistent snapshots of virtual machines.Further, the log format can be agnostic to virtual hard disk file formatand type, such that it can be used to capture changes to a virtual diskof any type and format.

The following section describes embodiments for switching a replicationlog associated with storage. A current log (e.g., a log file that hasbeen used to store indications of storage requests that are applied inparallel to primary storage) is switched out for a new log. The currentlog may then be applied to replica storage to synchronize the replicastorage with primary storage while maintaining write order dependency. Asubsequent section describes embodiments for switching multiplereplication logs that are associated with multiple primary storageinstances, where a write order dependency is present across the primarystorage instances (e.g., the storage instances are written to by adistributed application, etc.). This is followed by a still furthersection that describes embodiments for applying replication logs toreplica storage in a manner that maintains write order dependency acrossmultiple instances of replica storage.

II. Example Embodiments for Request Ordering Support when SwitchingVirtual Disk Replication Logs

FIG. 1 illustrates an example system 100 implementing the requestordering support when switching virtual disk replication logs inaccordance with one or more embodiments. Storage access requests 102 maybe provided by any source, such as a virtual machine (VM) 104. Althoughillustrated as being provided by virtual machine 104, storage requests102 can additionally or alternatively be provided by other components ormodules, such as processors or other sources. The storage accessrequests 102 may be any type of storage access requests, such as writerequests, requests to expand or contract the disk, or any other storageoperations that can result in changes to the disk. In one or moreembodiments, the storage access requests 102 represent write requests tostore data.

In the illustrated embodiment, the data is stored in one or more virtualdisks 106, each of which can represent one or more files stored onphysical storage media. A storage request processing module 108 directsand processes incoming requests 102 to the virtual disks 106. Forexample, the requests 102 may represent write requests that aretemporarily buffered at storage request processing module 108 until theycan be used to update a virtual disk 106. Each virtual disk 106 mayinclude a single virtual storage file (e.g., VHD file) or multiple files(e.g., a VHD file and one or more differencing disk files (also referredto as AVHD files)). Thus, for example, changes to a virtual disk 106 maybe made to a single file representing the virtual disk 106, and logs asdiscussed herein may be used in lieu of differencing disks or similarstates of the virtual disk 106 for replication purposes.

Replication management module 110 receives the same storage accessrequests 102 that are being received at storage request processingmodule 108. Storage access requests 102 may be received in differentmanners, such as from the virtual machine 104, from an intermediatemodule (not shown), from storage request processing module 108 itself,and so forth. In one or more embodiments, replication management module110 is implemented integrally with storage request processing module108. In such situations, replication management module 110 may receive acopy of the storage access requests 102 upon receipt of the requests 102at storage request processing module 108, or storage request processingmodule 108 may create and provide a copy of storage access requests 102to replication management module 110. It should be noted that modulessuch as storage request processing module 108 and replication managementmodule 110 can be implemented in different manners. For example, module108 and/or module 110 may be provided within the virtual machine 104,may be provided by a hypervisor, may be provided by a parent partitionoperating system or other operating system, and so forth.

Replication management module 110 can buffer the storage access requests102 in parallel with the buffering and/or processing of the storageaccess requests 102 by the storage request processing module 108. Thebuffered storage access requests 102 are written to one or more logs112, such as a log file, for replication purposes and typically withoutsignificantly impacting storage IOPS. Typically, each virtual disk 106has a corresponding log 112. As write requests or other storage accessrequests 102 are being processed to update the virtual disks 106 inresponse to virtual machine 104 processing, replication managementmodule 110 tracks changes to the virtual disks 106 in one or more logs112.

At various intervals and/or in response to various events, logs 112 canbe transmitted, such as via transmitter 114, elsewhere (e.g., to arecovery device) where a recovery system or virtual machine may beinstantiated to replicate the virtual machine 104. Transmitter 114,which may be a stand-alone transmitter or associated with another device(e.g., a transceiver, a network interface module, etc.), can provide thelog 112 to a destination such as a recovery system or server as arecovery replica of at least a portion of a virtual disk 106. When onelog is transmitted elsewhere, the log being transferred is referred toas the old log, and a new log is created. The buffered storage accessrequests are then written to the new log rather than the old log. Thisprocess of changing from storing the storage access requests in the newlog rather than the old log is also referred to as log switching.

FIG. 2 illustrates another example system 200 implementing the requestordering support when switching virtual disk replication logs inaccordance with one or more embodiments. System 200 is similar to system100 of FIG. 1, including storage (e.g., write) requests 102, one or morevirtual disks 106, a storage request processing module 108, and areplication management module 110. In system 200, a virtual machine orother source issues write requests 102 that will ultimately change oneor more virtual disks 106 with the data being written thereto. Bothstorage request processing module 108 and replication management module110 receive the write requests 102. As storage request processing module108 processes the write requests 102 for inclusion on a virtual disk106, replication management module 110 queues the write requests 102 forwriting to one or more logs 202.

In one or more embodiments, logs 202 are captured in memory 204 (e.g.,random access memory) to reduce input/output (I/O) processing andimprove TOPS relative to solutions involving writing to disk such asdifferencing disks. Each log 202 may be written to storage 206 (e.g., amagnetic or optical disk, a Flash memory drive, etc.) at desired regularor irregular intervals such as, for example, fixed intervals, randomintervals, intervals based on triggered events (e.g., the size of alllogs 202 in memory 204, the size of a particular log 202 in memory 204,etc.), and so forth. Replication management module 110 includes astorage write control module 208 that determines when a log 202 inmemory 204 is to be written to storage 206 as illustrated by one or morelogs 210. In one or more embodiments, storage write control module 208writes a log 202 to storage 206 as a log 210 when memory 204 that hasbeen allocated for the log 202 reaches a threshold. Each log 210 istypically a single file in storage 206, but can alternatively bemultiple files and/or portions of a file (e.g., multiple logs may bestored in a single log file). For example, a write of a log 202 frommemory 204 to log 210 in storage 206 may occur when the allocated memoryfor log 202 reaches 90% capacity. In one or more embodiments, storagewrite control module 208 also writes a log 202 to storage 206 as a log210 when the log for the corresponding virtual disk 106 is to beswitched to a new log, as discussed in more detail below. Byaccumulating write requests 102 in memory 204 and infrequently writingthe logs to physical storage 206, the impact on virtual machine workloadperformance and response times inside the virtual machine can bereduced.

At various intervals and/or in response to various events, logs 202and/or 210 can be transmitted, such as via transmitter 114, elsewhere asdiscussed above with reference to FIG. 1. When one log is transmittedelsewhere, the buffered storage access requests are then written to thenew log rather than the old log.

In systems 100 of FIG. 1 and 200 of FIG. 2, virtual machines or othersources may issue storage access requests having particular orderingrequirements. For example, database, mail server, or other applicationsin the virtual machine may implement their own recovery mechanisms anddesire to have particular storage access requests (e.g., particularwrites) occur in a particular order as part of those recoverymechanisms. Replication management modules 110 account for theseordering requirements when switching logs, as discussed in more detailbelow.

FIG. 3 illustrates an example architecture 300 for implementing therequest ordering support when switching virtual disk replication logs inaccordance with one or more embodiments. Architecture 300 can implement,for example, a system 100 of FIG. 1 or a system 200 of FIG. 2.Architecture 300 is discussed with reference to the storage accessrequests being I/O write requests, although various other types ofstorage access requests can also be processed by architecture 300. Inthe example architecture 300 the write requests are implemented as smallcomputer system interface (SCSI) request blocks (SRBs) 302. SRB 302 is arepresentative manner in which an I/O request can be submitted to astorage device. SRB 302 may include information such as the command tosend to the device, the buffer location and size, and so forth. In oneor more embodiments, each change request to a virtual disk is in theform of an SRB 302. While SRBs are discussed as an example, it should benoted that various other I/O request types can be used with thetechniques discussed herein.

In the illustrated example, SRB 302 is provided by an interface to upperlayers, shown as virtual hard disk (VHD) interface 304 (e.g., which maybe implemented in a VHD parser system or .sys file). In this example,VHD interface 304 represents an internal interface to the upper layers,which performs internal translation and sends SRB 302 to a replicationmanagement module, which in FIG. 3 is part of virtual disk parser 306.Storage requests may also be provided via the VHD interface 308, whichis also an interface to upper layers, where the storage requests may beprovided via an input/output control (IOCTL) call 310 that is handled byan IOCTL handler 312 of virtual disk parser 306. IOCTL handler 312provides an interface through which an application on the virtualmachine can communicate directly with a device driver using controlcodes. Thus, storage access requests may be received via one or moredifferent input types.

In one or more embodiments, virtual disk parser 306 can be an adaptationof a VHD mini-port, such as VHDMP.sys available in the Hyper-V®virtualization system available from Microsoft Corporation of Redmond,Wash. Assuming in this example that the virtual disk is represented by aVHD file 314, the storage stack for such VHD files 314 can include amini-port driver such as VHDMP.sys, which represents VHD parser 306. VHDparser 306 enables I/O requests to the VHD file 314 in storage 316(e.g., a magnetic or optical disk, a Flash memory drive, etc.) to besent to the host file system. The host file system is illustrated as anew technology file system (NTFS) 318, although various other host filesystems can alternatively be used.

For purposes of example, it is assumed in the description of examplearchitecture 300 that SRBs 302 include write requests to change avirtual disk such as VHD file 314. SRBs 302, which originate inside thevirtual machine, reach virtual disk parser 306 at SRB request handler320. In one or more embodiments, SRB request handler 320 creates aninstance of a custom data structure for each SRB 302, and embeds the SRB302 inside this instance which is added to VHD request queue 322. VHDrequest queue 322 maintains the write requests to VHD file 314 that arepending for processing. SRB request handler 320 adds these SRBs 302 toqueue 322, and as described below VHD request processing module 324removes the write requests from VHD request queue 322 to process thewrite requests. Multiple representative VHD request queue 322 entriesare depicted as V1 330, V2 332, V3 334 and V4 336. VHD request queue 322and VHD request processing module 324 together can be a storage requestprocessing module 108 of FIG. 1 or FIG. 2.

In one or more embodiments, IOCTL handler 312 may also receive requestsfrom management modules, such as virtual machine management service(VMMS) 340 (e.g., an executable or .exe file) provided as part of theHyper-V® virtualization system. VMMS 340 generally represents amanagement service that serves as a point of interaction for incomingmanagement requests. VMMS 340 can provide requests to IOCTL handler 312for enabling and disabling change tracking for a virtual disk. Forexample, VMMS 340 may issue a request via an IOCTL call 310 to IOCTLhandler 312, which causes log request queue 342 and log requestprocessing module 344 to be initialized. VMMS 340 can also providerequests to IOCTL handler 312 for managing the switching of logs whilethe virtual machine is running. For example, VMMS 340 may issue requeststo advance virtual disk parser 306 through multiple stages of switchinglogs, as discussed in more detail below.

When change tracking is enabled, another instance of the custom datastructure for the SRB 302 that is added to VHD request queue 322 iscreated and added as an entry to log request queue 342. In one or moreembodiments, a data buffer of write requests (e.g., SRBs 302) may beshared by the custom data structure instances for the SRBs 302 in bothVHD request queue 322 and log request queue 342. Log request queue 342maintains the log write requests that are pending for processing.Representative log request queue 342 entries are depicted as L1 350, L2352, L3 354 and L4 356. Entries of log request queue 342 and VHD requestqueue 322 correspond to one another—an entry of log request queue 342that includes the same SRB 302 (or references the same shared SRB 302)as an entry of VHD request queue 322 is referred to as corresponding toor being associated with that entry of VHD request queue 322. Logrequest queue 342 and log request processing module 344 together can bea replication management module 110 of FIG. 1 or FIG. 2.

VHD request processing module 324 removes queued write requests fromqueue entries 330-336 of VHD request queue 322 to process the writerequests. VHD request processing module 324 processes write requests bywriting the requested data to VHD file 314. Based on the virtual harddisk format and type, in one or more embodiments VHD request processingmodule 324 sends one or more I/O request packets (IRPs) to VHD file 314via NTFS 318 to complete each write request.

Log request processing module 344 removes queued write requests from logqueue entries 350-356 of log request queue 342 to process the writerequests. Log request processing module 344 processes the write requestsor log queue entries by storing in log 364 the log queue entries 350-356that include the write requests. Log 364 can be one or more log files,and the log queue entries 350-356 can be stored to the one or more logfiles via NTFS 318. Thus, log request queue 342 is copied to log 364that, in the illustrated embodiment, is stored in storage 368 (e.g., amagnetic or optical disk, a Flash memory drive, etc.). Storage 368 maybe the same or different storage as storage 316 in which the VHD filesare stored. It should be noted that in one or more embodiments, whilethe log 364 may be stored in some storage 368, the log is cached orotherwise buffered in memory (e.g., random access memory) until a timewhen the log is to be sent to storage 368. Log request processing module344 processing the write requests or log queue entries includes storingthe log queue entries 350-356 that include the write requests in such acache or buffer.

New log entries for write requests are created for each new storagerequest and placed in log request queue 342, typically substantially inparallel with the creating and placing of a new VHD request queue entryfor the write request in VHD request queue 322. Similarly, the nextwrite request in log request queue 342 is removed and copied to log 364,typically substantially in parallel with the corresponding entry for thewrite request being removed from VHD request queue 322 and processed byVHD request processing module 324. VHD request queue 322 and log requestqueue 342 are typically first-in-first-out (FIFO) queues, although otherqueuing techniques can alternatively be used.

A particular queued write request (e.g., a request in one of queueentries 330-336) is considered to be complete in response to twoconditions being satisfied: 1) all of the issued IRPs to VHD file 314for the write request are completed, and 2) the log request queue entrycorresponding to the VHD request queue entry that includes the writerequest is written to log 364. The log request queue entry being writtento log 364 refers to the log request queue entry being added to the logregardless of whether the log is cached or otherwise buffered in memory(e.g., the log request queue entry can be written to log 364 even thoughthe log, and thus the log request queue entry, is being maintained in abuffer or other memory rather than storage 368). In response to aparticular write request being complete, VHD parser 306 returns acompletion response for the particular write request to the virtualmachine from which the particular write request was received. Thecompletion response can be returned to the virtual machine by any ofvarious components or modules of virtual parser 306.

In one or more embodiments, the log can be stored (at least temporarily)in memory as discussed above. The log stored in memory can be directlytransmitted to one or more recovery devices from memory. Alternatively,the log can be written to a physical storage medium (e.g., magnetic oroptical disk, Flash memory disk, etc.) and subsequently transmittedelsewhere (e.g., to one or more recovery devices) from the physicalstorage medium. Regardless of whether the log is transmitted from memoryor a physical storage medium, various conditions can dictate when thelog will be transmitted elsewhere. The condition may be, for example, atime, a time duration, a triggering event, and so forth. For example,the condition may be a particular time interval (e.g., five minutes), aparticular event (e.g., a log file reaching a threshold size and/orhaving a threshold number of entries), and so forth. The recoverydevices can be any of a variety of different recovery servers and/orrecovery storage devices.

When the log, referred to as the old log, is transmitted elsewhere(e.g., to a recovery device), a new log is created. Log requestprocessing module 344 then proceeds to store entries in log requestqueue 342 into the new log. This process of changing from storingentries in log request queue 342 into the new log rather than the oldlog is also referred to as log switching.

The recovery device is a separate computing device from the deviceimplementing architecture 300 and/or a separate storage device fromstorage 316 (and storage 368). The recovery device receives thetransmitted log and maintains or otherwise uses the transmitted log forrecovery purposes. For example, if a malfunction were to occur in thedevice implementing architecture 300, then the logs received by therecovery device can be used to recreate VHD file 314. The recoverydevice can maintain or otherwise use the transmitted log in differentmanners. In one or more embodiments, the recovery device stores the log,allowing the requests in the log to be subsequently applied, if recoveryof VHD file 314 is desired, to a previously stored copy of VHD file 314(a copy of VHD file 314 that does not include the changes indicated inthe log, and that is stored on the recovery device or elsewhere) inorder to recover VHD file 314. Alternatively, the requests in the logcan be processed and applied to a previously stored copy of VHD file 314(a copy of VHD file 314 that does not include the changes indicated inthe log, and that is stored on the recovery device or elsewhere),allowing a duplicate copy of VHD file 314 to be maintained at therecovery device. The request in the log can be processed and applied toa previously stored copy of VHD file 314 in a manner analogous to thatperformed by VHD request processing module 324 in processing requests inVHD request queue 322 as discussed above.

Log 364 includes the storage requests from log request queue 342, aswell as sufficient additional data for VHD file 314 to be recoveredand/or replicated. Log 364 can include various data and/or metadataregarding the storage requests stored in log 364 from log request queue342 and VHD file 314. In one or more embodiments, log 364 includes aheader portion, one or more metadata portions, and one or more dataportions. The one or more data portions include the entries from the logrequest queue (or alternatively the data from the entries of the logrequest queue) that include the write requests or other storagerequests.

The header portion includes, for example, information to identify thelog, information to indicate the size of one or more metadata portions,information to indicate how many metadata portions are included in thelog, and information to indicate the location of the last valid data ofthe log (the end of the log or EOL). The header portion can includevarious other information, such as a version identifier of the log, atime stamp indicating when the log was created (and/or last modified), asize of the log, a checksum for the log, an error code (e.g., indicatingwhether an error occurred in creating or receiving the log), and soforth.

Each metadata portion includes, for example, a metadata header and oneor more metadata entries. The metadata provides, for example,information describing the changes to the virtual disk (the VHD file).For example, the metadata header can include an indication of the sizeof the metadata header, an indication of the location of the previousmetadata portion in the log, an indication of the location of the nextmetadata portion in the log, an indication of the number of metadataentries in the metadata portion, a checksum value for the metadataportion, and so forth. Each metadata entry provides, for example,information about the virtual disk address range that is modified. Forexample, each metadata entry can include a byte offset that indicates anactual physical address on the virtual disk that was modified, achecksum value for the metadata entry, a data length indicating a sizeof the data in a data portion, a timestamp value indicating a timeand/or date when the storage request resulting in the data in a dataportion was received by the VHD parser, the meta operation of the datain a data portion (e.g., a write operation, a no operation (NOOP),etc.), and so forth.

In the example architecture 300, although one VHD file 314 and one log364 are illustrated, in one or more embodiments architecture 300includes multiple VHD files 314 (stored in the same and/or differentstorage 316) as well as multiple logs 364 (stored in the same and/ordifferent storage 368). VHD parser 306 can include a separate VHDrequest queue for each VHD file with each VHD request queuecorresponding to a single VHD file, or alternatively a single VHDrequest queue can correspond to (and thus include entries for) multipledifferent VHD files. VHD parser 306 can also include a separate logrequest queue for each log with each log request queue corresponding toa single log, or alternatively a single log request queue can correspondto (and thus include entries for) multiple different logs.

In situations in which the system (e.g., system 100 of FIG. 1 and/orsystem 200 of FIG. 2) or architecture (e.g., architecture 300 of FIG. 3)includes multiple logs, the log switching includes switching of all ofthe multiple logs at approximately the same time. However, there istypically no guaranteed ordering in which the logs are switched,typically no dependency on one log being switched before another, andtypically no guaranteed speed at which the logs are switched.Accordingly, a virtual machine cannot rely on logs being switched in aparticular order.

FIG. 4 is a flowchart illustrating an example process 400 forimplementing request ordering support when switching virtual diskreplication logs in accordance with one or more embodiments. Process 400is carried out, for example, by a system 100 of FIG. 1, a system 200 ofFIG. 2, and/or an architecture 300 of FIG. 3, and can be implemented inhardware or a combination of hardware with one or both of software andfirmware. Process 400 is shown as a set of acts and is not limited tothe order shown for performing the operations of the various acts.Process 400 is an example process for implementing request orderingsupport when switching virtual disk replication logs; additionaldiscussions of implementing request ordering support when switchingvirtual disk replication logs are included herein with reference todifferent figures.

Generally, process 400 is performed in two parts. In a first part 402,the new logs are initialized and processing of new log queue entries isblocked. Blocking of new log queue entries refers to entries in the logqueue not being processed (e.g., by log request processing module 344 ofFIG. 3) and stored in the log file; however, new entries can be added tothe log request queue while processing of new log queue entries isblocked. In a second part 404, the new logs are changed to, processingof new log queue entries is unblocked, and the switching of logs isfinalized. After the processing of new log queue entries is unblocked,entries in the log queue can be processed (e.g., by log requestprocessing module 344 of FIG. 3) and stored in the new logs.

More specifically, first part 402 includes a first stage 412 in whichthe new logs are initialized. For each log being switched (e.g., eachcurrent log), a new log is initialized. Initializing a new log refers togenerating the appropriate data structures, creating the appropriateheaders, and so forth for the new log. During first stage 412, log queueentries continue to be processed (e.g., by log request processing module344 of FIG. 3), and VHD request queue entries continue to be processed(e.g., by VHD request processing module 324 of FIG. 3).

First part 402 also includes a stage 414 in which processing of new logqueue entries is blocked. Stage 414 occurs after all of the new logs areinitialized (although alternatively may occur after less than all of thenew logs are initialized). In stage 414, log queue entries can be addedto the log request queue, VHD queue entries can be added to the VHDrequest queue, and VHD queue entries can be processed (e.g., by VHDrequest processing module 324 of FIG. 3), but log queue entries are notprocessed (e.g., by log request processing module 344 of FIG. 3). Asdiscussed above, a storage request is not indicated as being completeduntil both the VHD queue entry is processed and the corresponding logqueue entry is processed. Thus, although VHD queue entries can beprocessed while processing of new log queue entries is blocked, therequests in such processed VHD queue entries are not indicated as beingcompleted because the corresponding log queue entry has not yet beenprocessed.

Second part 404 includes a stage 416 in which the change to the new logsoccurs and processing of new log queue entries is unblocked. For eachlog being switched, the new log (initialized in stage 412) is changed toin stage 416. Changing to the new log refers to any pointers or otherindications of the log to be used being changed to the new log ratherthan the old log (the log being switched from, and in which log queuerequests were stored prior to blocking processing of the new log queueentries in stage 414). For all logs being switched, after the new logshave been changed to, processing of new log queue entries is unblocked.After processing of new log queue entries is unblocked, the operation ofthe system or architecture resumes as discussed above—VHD queue entriescan be added to the VHD request queue and processed (e.g., by VHDrequest processing module 324 of FIG. 3), and log queue entries can beadded to the log request queue and processed (e.g., by log requestprocessing module 344 of FIG. 3).

Second part 404 also includes a stage 418 in which switching of the logsis finalized. Finalizing switching of the logs includes variousoperations to transfer the old logs elsewhere (e.g., to a recoverydevice). Finalizing switching of the logs can include, for example,flushing any queue entries of the old log in memory to storage, addingadditional information to a header of the old log, transmitting the oldlog elsewhere, and so forth. Stage 418 typically occurs after processingof the new log queue entries is unblocked, although stages 416 and 418can alternatively be performed at least in part at the same time (so atleast some of the finalization in stage 418 can be performed while thenew logs are being changed to and processing of the new log queueentries is being unblocked in stage 416).

FIG. 5 is a state diagram 500 illustrating example states forimplementing request ordering support when switching virtual diskreplication logs in accordance with one or more embodiments. Statediagram 500 illustrates the different states that a component or moduleof a VHD parser (e.g., VHD parser 306 of FIG. 3) or replicationmanagement module (e.g., module 110 of FIGS. 1 and 2) transitionsthrough. State diagram 500 is discussed with reference to a switchmanager implementing state diagram 500. The switch manager may be IOCTLhandler 312 of FIG. 3, another component or module of the VHD parser orreplication management module, and so forth. Commands or requests totransition to different states are received by the switch manager from amanagement service (or other module), such as VMMS 340 of FIG. 3,another component or module of a hypervisor, and so forth.

When change tracking is enabled (e.g., the use of logs and log requestqueues as discussed herein is enabled), the switch manager transitionsto a new log ready for initialize state 502. The switch manager waits instate 502 until an initialize new log command is received from themanagement service. The initialize new log command is received aftersome interval elapses, an event occurs, etc. as discussed above.

In response to the initialize new log command, the switch managertransitions to a new log initialized state 504. In state 504, the switchmanager initializes (or communicates with one or more other modules orcomponents to initialize) the new logs. The first stage 412 of FIG. 4 isimplemented by the switch manager while in state 504. After the new logsare initialized, the switch manager notifies (e.g., communicates aresponse to) the management service that the new logs are initialized.

In response to the notification that the new logs are initialized, themanagement service sends to the switch manager a block write requestsresponse. In response to the block write requests response, the switchmanager transitions to a new log writes blocked state 506. In state 506,the switch manager blocks processing of new log queue entries (e.g., bynotifying log request processing module 344 to cease processing of logqueue entries), and changes from the old logs to the new logs. Thischange can be, for example, providing indications (e.g., identifiers of)the new logs to log request processing module 344. The second stage 414of FIG. 4 as well as part of the third stage 416 (the changing to thenew logs) is implemented by the switch manager while in state 506. Afterprocessing of new log queue entries is blocked and the change to the newlogs is completed, the switch manager notifies (e.g., communicates aresponse to) the management service that processing of new log queueentries is blocked and the change to the new logs is completed.

In response to the notification that processing of new log queue entriesis blocked and the change to the new logs is completed, the managementservice sends to the switch manager an unblock write requests response.In response to the unblock write requests response, the switch managertransitions to a new log writes unblocked state 508. In state 508, theswitch manager unblocks processing of new log queue entries (e.g., bynotifying log request processing module 344 to resume processing of logqueue entries), and finalizes switching of the logs. Various operationscan be performed in finalizing switching of the logs, as discussedabove. The fourth stage 418 of FIG. 4 is implemented by the switchmanager while in state 508. After processing of new log queue entries isunblocked and the switching of the logs is finalized, the switch managernotifies (e.g., communicates a response to) the management service thatprocessing of new log queue entries is unblocked and the switching ofthe logs is finalized.

In response to notification that processing of new log queue entries isunblocked and the switching of the logs is finalized, the managementservice sends to the switch manager a finalize old logs request. Inresponse to the finalize old logs request, the switch managertransitions to new log ready for initialize state 502.

While in state 502, 504, or 506, an unexpected request may be receivedby the switch manager. An unexpected request received refers to arequest other than a request that would allow the switch manager totransition to the next state to continue the log switching (e.g., anyrequest other than an initialize new log request while in state 502, anyrequest other than a block write requests response while in state 504,any request other than an unblock write requests response while in state506). In response to an unexpected request, the switch managertransitions to new log cleanup state 510. In new log cleanup state 510,the switch manager performs various operations to undo any changes madeas part of the log switching. These operation can include, for example,deleting new logs that were created, preventing old logs from beingchanged from, and so forth. After completing the various operations toundo any changes made as part of the log switching, the switch managertransitions to new log ready for initialize state 502.

Similarly, while in state 508 an unexpected request may be received bythe switch manager. An unexpected request refers to a request other thana request that would allow the switch manager to transition to the nextstate to continue the log switching (e.g., any request other than afinalize old log request). In response to an unexpected request, theswitch manager transitions to change tracking disabled state 512. Instate 512, change tracking (e.g., the use of logs and log request queuesas discussed herein) is disabled). If an unexpected request is receivedat state 508, the switch manager assumes that a significant problem hasoccurred and thus, rather than entering new log cleanup state 510,disables change tracking.

In one or more embodiments, situations can arise where the managementservice malfunctions (e.g., crashes or otherwise ceases normaloperation) during log switching. In order to avoid such a malfunctionfrom causing processing of new log queue entries from being blockedindefinitely (e.g., due to an unblock write requests response not beingreceived from the management service because of the malfunction), theswitch manager maintains a context for the management service when aninitialize new log request is received. This context is identified aspart of the initialize new log request, and is typically referred to asa handle that is opened by the management service or other identifierassigned by (or reported to) the operating system. If the managementservice malfunctions, any such handles or identifiers of the managementservice are closed by the operating system, and the switch manager isnotified of such closures. Thus, if a handle maintained as the contextfor the management service by the switch manager is closed prior to afinalize old log request being received having that same handle, thenthe switch manager determines that the management service malfunctionedduring the log switching. The switch manager proceeds to takeappropriate remedial action (e.g., transition to new log cleanup state510 and/or change tracking disabled state 512), including unblockingprocessing of new log queue entries. Thus, a malfunction in themanagement service will not cause processing of new log queue entries tobe blocked indefinitely.

The techniques discussed herein support various different usagescenarios. By blocking processing of new log queue entries but allowingprocessing of VHD queue entries during log switching, the performanceimpact due to the log switching is reduced because the VHD queue entriescan continue to be processed. The processing of new log queue entriesthat is blocked can be writing of the log queue entries to memory ratherthan storage, as discussed above, so when the processing of new logqueue entries is unblocked the new log queue entries can be processedquickly relative to the writing of VHD queue entries to storage.

Furthermore, the techniques discussed herein allow the log switching tooccur while maintaining request ordering for write order dependentrequests. In some situations, storage access requests issued fromvirtual machines have particular ordering requests. For example, anapplication of the virtual machine may use a write-ahead-logging (WAL)protocol in which one write request (e.g., a write to a database) to oneVHD is not issued until confirmation of completion of another writerequest (e.g., a write to a log record maintained by the application) toanother VHD is received. The techniques discussed herein allow logswitching while maintaining such ordering constraints.

For example, assume that two write requests W1 followed by W2 are issuedby a virtual machine, and that the order of the write requests is to bemaintained (W2 is to be performed after W1). A response indicatingcompletion of W1 is returned after W1 is written to both the VHD fileand the log file, and in response to this indication the virtual machineissues W2. By blocking processing of a log queue entry for W1 while logswitching, the write of W1 to the log file and thus the indication ofcompletion of W1 is delayed until the log switching is completed. Thisblocking allows the situation where W1 and W2 are received after one logfile is switched but before another log file is switched from resultingin W1 being written to a new log file (and thus not yet transferred to arecovery device) and W2 being written to an old log file (that istransferred to a recovery device as the log switching completes). Such asituation where W2 is transferred to a recovery device but W1 is nottransferred would violate the request ordering for W1 and W2 in therecovery system, but is avoided using the techniques discussed herein.

FIG. 6 is a flowchart illustrating an example process 600 forimplementing request ordering support when switching virtual diskreplication logs in accordance with one or more embodiments. Process 600is carried out, for example, by a system 100 of FIG. 1, a system 200 ofFIG. 2, and/or an architecture 300 of FIG. 3, and can be implemented insoftware, firmware, hardware, or combinations thereof. Process 600 isshown as a set of acts and is not limited to the order shown forperforming the operations of the various acts. Process 600 is an exampleprocess for implementing request ordering support when switching virtualdisk replication logs; additional discussions of implementing requestordering support when switching virtual disk replication logs areincluded herein with reference to different figures.

In process 600, storage access requests are received from a virtualmachine (act 602). These storage access requests can be write requestsand/or other requests as discussed above.

One of multiple virtual hard disks is updated as directed by the storageaccess request (act 604). The updating can be, for example, writing datato the virtual hard disk as discussed above. Each storage access requesttypically indicates one virtual hard disk that is to be updated, but canalternatively indicate multiple virtual hard disks that are to beupdated.

Information associated with the storage access request is also stored inone of multiple logs (act 606). Each log (e.g., a log file), alsoreferred to as a replication log, can correspond to one of the virtualhard disks as discussed above.

The multiple logs are switched while maintaining request ordering forwrite order dependent requests (act 608). This switching can be done inmultiple parts and/or multiple stages as discussed above. As part ofthis switching, the old logs (the logs being switched from) can betransferred to a recovery device, as discussed above. Request orderingis maintained for write order dependent requests at least in part byblocking processing of the information associated with each storageaccess request, such as by blocking storing log request queue entries inthe log, as discussed above.

III. Example Embodiments for Switching Replication Logs Used to Snapshota Multi-Stream Application on Multiple Hosts

As described above, replication logs may be maintained and switched outto be used to update replica storage with changes that were made toprimary storage. In some cases, multiple storage instances (e.g.,virtual disks, physical disks, memory devices, etc.) may store data thatis related. For instance, multiple virtual machines may each operaterespective portions of a same distributed application, such that writesmade to their respective primary storage have a write order that needsto be maintained when applied to the corresponding replica storage. Inother words, an order of writes made by a first virtual machine to itsprimary storage and an order of writes made by a second virtual machineto its primary storage may need to be maintained with regard to the samewrites being made to replica storage, because the first and secondvirtual machines may communicate with each other, impacting the timingand contents of their respective writes to storage, thereby creating awrite order dependency issue.

Accordingly, the embodiments described in the preceding section may bemodified to coordinate the timing of the switching of replication logs,to maintain write order consistency. Such embodiments may be implementedin various ways. For instance, FIG. 7 shows a block diagram of a system700 that includes multiple virtual disks that store write orderdependent data, and that maintains write order dependency across virtualdisks, according to example embodiments. For example, system 700 may beincluded in a computer network, such as a computer cluster (connectedcomputers that work together) that implements distributed applicationsand incorporates a storage network, or any other computer network thatincludes multiple computing devices (e.g., computers, servers, etc.)that store interrelated data in storage.

As shown in FIG. 7, system 700 includes a computing device 702, acomputing device 704a, and a computing device 704b. Computing device 702includes a write order consistent tracking (WOCT) coordinator 706.Computing device 704a includes a first virtual machine (VM) 104a, afirst storage request processing module (SRPM) 108a, a first replicationmanagement module (RMM) 110a, a second VM 104b, a second SRPM 108b, asecond RMM 110b, and a first agent 708a. Computing device 704b includesa third VM 104c, a third SRPM 108c, a third RMM 110c, a fourth VM 104d,a fourth SRPM 108d, a fourth RMM 110d, and a second agent 708b. Thesefeatures/elements of system 700 are described as follows.

It is noted that two computing devices that each include two virtualmachines are shown in FIG. 7 for purposes of illustration. In otherembodiments, further numbers of computing devices may be present,including tens, hundreds, thousands, and greater numbers of computingdevices, and other numbers of virtual machines may be present, with eachcomputing device including one or more virtual machines. Furthermore,storage 710a, 710b, 710c, and 710d are physical storage devices, and mayinclude memory devices, hard disk drives, and/or other forms of physicalstorage. Still further, note that although WOCT coordinator 706 is shownin FIG. 7 in a computing device that is separate from computing devicescontaining virtual machines and agents, in another embodiment, WOCTcoordinator 706 may be in a same computing device with an agent and oneor more virtual machines.

First VM 104a, first SRMP 108a, and first RMM 110a are respectiveexamples of VM 104, SRPM 108, and RMM 110 described in the precedingsection. Similarly, second VM 104b, second SRMP 108b, and second RMM110b, third VM 104c, third SRMP 108c, and third RMM 110c, and fourth VM104d, fourth SRMP 108d, and fourth RMM 110d are all respective examplesof VM 104, SRPM 108, and RMM 110. Furthermore, in a similar manner asdescribed above, first VM 104a stores data in storage 710a in one ormore VDs 106a through SRPM 108a, and one or more logs 112a correspondingto VDs 106a are generated by RMM 110a, and stored in storage 710a, tostore storage access requests from first VM 104a for replicationpurposes. Similarly, second VM 104b uses second SRMP 108b and second RMM110b to store data in VD 106b in storage 710b, and generate logs 112bthat are stored in storage 710b, third VM 104c uses third SRMP 108c andthird RMM 110c to store data in VD 106c in storage 710c, and generatelogs 112c that are stored in storage 710c, and fourth VM 104d usesfourth SRMP 108d and fourth RMM 110d to store data in VD 106d in storage710c, and generate logs 112c that are stored in storage 710d. Becausethese features of FIG. 7 are described elsewhere herein (e.g., thepreceding section), this description is not provided again in full inthis section for purposes of brevity.

In embodiments, WOCT coordinator 706 in computing device 702 isconfigured to coordinate the switching of replication logs acrosscomputing devices 704a, 704b, etc., to maintain write order consistency.For instance, WOCT coordinator 706 may communicate with agents atcomputing devices that contain virtual machines, such as agents 708a and708b. WOCT coordinator 706 may instruct the agents to initiate logswitching for all of the virtual machines at their respective computingdevices, and to provide the resulting old logs (the logs switched out)to WOCT coordinator 706 or elsewhere to be applied to replica storage.

For example, in an embodiment, WOCT coordinator 706 may operateaccording to FIG. 8. FIG. 8 shows a flowchart 800 providing a processfor the switching of virtual disk replication logs in a manner thatmaintains write order dependency across virtual disks, according to anexample embodiment. Flowchart 800 is described as follows with respectto FIG. 7 and FIG. 9. FIG. 9 shows a block diagram of a WOCT coordinator900, according to an example embodiment. WOCT coordinator 900 is anexample of WOCT coordinator 706 of FIG. 7. Further structural andoperational embodiments will be apparent to persons skilled in therelevant art(s) based on the following description.

Flowchart 800 begins with step 802. In step 802, a cycle of a logswitching of a plurality of logs associated with a plurality of virtualdisks at a plurality of computing devices is initiated, the virtualdisks storing data that is write order dependent amongst the virtualdisks. For example, in an embodiment, log switching initiator 902 ofWOCT coordinator 900 (FIG. 9) may initiate a cycle of log switching forinstances of storage at computing devices. For example, in anembodiment, log switching 902 may transmit an instruction to the agentsat the computing devices through a network (e.g., a local area network,a wide area network, a combination of networks such as the Internet, astorage area network, etc.), and/or may initiate the log switching cyclein another way. Further example embodiments for initiating a cycle oflog switching according to step 802 are described below.

In step 804, the cycle of the log switching of the plurality of logs atthe plurality of computing devices is coordinated across the virtualdisks to maintain request ordering for write order dependent requests.In an embodiment, log switching manager 904 of WOCT coordinator 900(FIG. 9) may be configured to coordinate the cycle of log switchinginitiated by log switching initiator 902. Log switching manager 904 mayconfigured to coordinate one or more stages of the cycle of logswitching by communicating with the agents, such that each stage isperformed and confirmed by each agent before enabling the next stage tobe performed. For instance, in an embodiment, log switching 902 maytransmit instructions or control codes, may use exclusive locks, and/ormay use other techniques to coordinate the log switching cycle. Furtherexample embodiments for coordinating a cycle of log switching accordingto step 804 are described below.

Accordingly, in embodiments, log switching initiator 902 of WOCTcoordinator 900 may initiate a cycle of log switching (step 802) invarious ways. For instance, FIG. 10 shows a flowchart 1000 providing aprocess for initiating log switching, according to an exampleembodiment. Log switching initiator 902 may operate according toflowchart 1000 in an embodiment. Flowchart 1000 is described as followswith respect to FIG. 7, FIG. 9, and FIGS. 11 and 12. FIGS. 11 and 12show block diagrams of a system 1100 of using lock files to coordinatelog switching, according to example embodiments. FIGS. 11 and 12 eachshow log switching initiator 902, log switching manager 904, agent 708a,agent 708b, and storage 1102 that is accessible by each of log switchinginitiator 902, log switching manager 904, agent 708a, agent 708b.Storage 1102 includes first-fourth begin stage lock files 1106a-1106dand first-fourth end stage lock files 1108a-1108d. Further structuraland operational embodiments will be apparent to persons skilled in therelevant art(s) based on the following description.

Flowchart 1000 begins with step 1002. In step 1002, an exclusive lock istaken on each of a plurality of begin stage lock files, each begin stagelock file associated with a corresponding stage of a plurality of stagesof the cycle of log switching. In embodiments, a cycle of log switchingmay be performed in any number of stages. For example, log switching maybe performed for a virtual machine according to FIG. 4, where fourstages 412, 414, 416, and 418 are used in a cycle. In other embodiments,other numbers of stages may be used. A variety of mechanisms may be usedto control/coordinate switching from one stage to another. For example,in an embodiment, one or more lock files may be used at each stage tocoordinate stages of log switching. A lock file is a file whose contentmay be irrelevant (other than an identifier of a holder of the lock inthe file, etc.), but is used to signal that a resource is locked. Inembodiments, whether a lock file is locked or not may be an indicator ofwhether a stage may begin, whether a coordinator or process hasperformed its functions in a stage, or whether some other task relatedto a stage has been or can be performed. In one example embodiment, eachstage may have a corresponding begin stage lock file and an end stagelock file, as further described below.

For instance, as shown in FIG. 11, begin stage lock file 1106a and endstage lock file 1108a are first stage lock files 1104a that may beassociated with a first stage (e.g., first stage 412). Likewise, beginstage lock file 1106b and end stage lock file 1108b are second stagelock files 1104b that may be associated with a second stage (e.g.,second stage 414), begin stage lock file 1106c and end stage lock file1108c are third stage lock files 1104c that may be associated with athird stage (e.g., third stage 416), and begin stage lock file 1106d andend stage lock file 1108d are fourth stage lock files 1104d that may beassociated with a fourth stage (e.g., fourth stage 418).

In the example of FIG. 11, at the outset, begin stage lock files1106a-1106d and end stage lock files 1108a-1108d may have no locksplaced on them by agents or coordinators. In an embodiment, logswitching initiator 902 may verify that no locks are taken on beginstage lock files 1106a-1106d and end stage lock files 1108a-1108d (e.g.,no lock entries in the lock files by agents, etc.). Log switchinginitiator 902 takes an exclusive lock on each of begin stage lock files1106a-1106d, as represented by exclusive locks 1110a-1110d (e.g., writesan exclusive lock entry to the lock files, etc.). In this manner, agents708a, 708b, and any other agents that are present, are prevented fromtaking a lock on begin stage lock files 1106a-1106d, and are therebyindicated to not perform their functions for any stage.

Referring back to FIG. 10, in step 1004, a log switching initiationinstruction is transmitted to a plurality of agents at the computingdevices. For example, as shown in FIG. 11, log switching initiator 902may transmit a log switching initiation instruction 1116. Log switchinginitiation instruction 1116 may be transmitted through a network asdescribed elsewhere herein to be received by agents 708a, 708b, etc. Logswitching initiation instruction 1116 indicates to agents 708a, 708b,etc. that a cycle of log switching is to commence, and indicates thatagents 708a, 708b, etc. should prepare for a cycle of log switching, andprovide a response to indicate readiness for the log switching.

Accordingly, upon receipt of log switching initiation instruction 1116,agents 708a, 708b, etc. each prepare for log switching. The agents mayperform one or more preparatory processes for log switching. Forinstance, as shown in FIG. 11, each agent 708a, 708b, etc., may take ashared lock on each of end stage lock files 1108a-1108d, as representedby shared locks 1112a-1112d taken by agent 706a (e.g., writes a sharedlock entry to the lock files, etc.), shared locks 1114a-1114d taken byagent 706b, etc.

In step 1006, a response is received from each of the agents, eachresponse received from an agent of the plurality agents indicating thatthe agent took a shared lock on each of a plurality of end stage lockfiles, each end stage lock file associated with a corresponding stage ofthe plurality of stages. In an embodiment, as shown in FIG. 12, aftertaking the shared locks, agents 708a, 708b, etc. may transmit acorresponding readiness response 1202a, 1202b, etc. Readiness responses1202a, 1202b, etc., may be transmitted through a network as describedelsewhere herein to be received by log switching indicator 902. In thismanner, agents 708a, 708b, and any other agents that are present,indicate their readiness to log switching coordinator 902 to performtheir functions for each stage.

It is noted that if an agent is down, if the agent is unable to take allof the shared locks on the end stage lock files, or if there is anotherproblem with the agent, the agent may not transmit its readinessresponse and/or the readiness response may not be received by logswitching coordinator 902. Log switching coordinator 902 may beconfigured in various ways to handle the situation where a readinessresponse is not received from one or more agents. For instance, logswitching coordinator 902 may be configured to abort the cycle of logswitching, and may transmit an abort command to the agents. In anotherembodiment, log switching coordinator 902 may decide to continue thecycle of log switching without the agent. A result of this may be thatthe particular non-responsive agent does not cause the switching of logsfor the virtual machines at its computing device. This may be acceptablewhere it is presumed that the agent can catch up during a subsequentcycle of log switching, and/or based on any other suitableconsideration.

Accordingly, in the manner of flowchart 1000, log switching initiator902 of WOCT coordinator 900 may initiate a cycle of log switching (step802). As described above, log switching manager 904 of WOCT coordinator900 may be configured to coordinate a cycle of log switching (step 804of FIG. 8). For instance, FIG. 13 shows a flowchart 1300 providing aprocess for coordinating a stage of log switching, according to anexample embodiment. Log switching manager 904 may operate according toflowchart 1300 in an embodiment. Flowchart 1300 is described as followswith respect to FIG. 9 and FIGS. 11 and 12. Further structural andoperational embodiments will be apparent to persons skilled in therelevant art(s) based on the following description.

Flowchart 1300 begins with step 1302. In step 1302, the exclusive lockis released on the associated begin stage lock file to signal thebeginning of the stage to the agents. In an embodiment, to signal thebeginning of a stage to agents 706a, 70b, log switching manager 904 mayrelease the exclusive lock on the stage. For instance, with respect toFIG. 11, the first stage (e.g., first stage 412 of FIG. 4) may bedesired to be performed. In such case, log switching manager 904 mayrelease exclusive lock 1110a on begin stage lock file 1106a. Logswitching manager 904 may release the exclusive lock directly (e.g., byremoving an exclusive lock entry from the begin stage lock file), or mayinstruct log switching initiator 902 to release the lock.

Once the exclusive lock is released on the begin stage lock file of astage, this signals to the agents that the functions of the stage may beperformed. For instance, after initiation of the cycle of log switching(e.g., step 802 in FIG. 8), agents 708a, 708b, etc., may check beginstage lock files 1106a-1106d periodically to determine whether eachstage has begun. When an agent determines that an exclusive lock is notpresent on a begin stage lock file, this indicates to the agent that thestage has begun. In such case, the agent may have the functions of thestage performed with respect to the log files associated with itsvirtual machines.

For example, if the current stage is the first stage, as shown in FIG.12, log switching manager 904 may cause exclusive lock 1110a (of FIG.11) on begin stage lock file 1106a to be released, and thus not shown inFIG. 12. Agents 706a, 706b, etc. may determine that exclusive lock 1110ahas been released, and thus the first stage may be performed. Forinstance, with respect to FIG. 4, first stage 412 may be performed,where new logs may be initialized as described above. If the stage isthe second stage, the agents may determine that exclusive lock 1110b onbegin stage lock file 1106b has been released, and that the second stage(e.g., second stage 414 of FIG. 4) may be performed. If the stage is thethird stage, the agents may determine that exclusive lock 1110c on beginstage lock file 1106c has been released, and that the third stage (e.g.,third stage 416 of FIG. 4) may be performed. If the stage is the fourthstage, the agents may determine that exclusive lock 1110d on begin stagelock file 1106d has been released, and that the fourth stage (e.g.,fourth stage 418 of FIG. 4) may be performed.

From step 1302 of FIG. 13, operation proceeds to step 1304.

In step 1304, an exclusive lock is attempted to be taken on theassociated end stage lock file, the exclusive lock enabled to be takenwhen the agents have released all shared locks on the associated endstage lock file to signify completion of the stage by the agents. In anembodiment, when each agent has confirmed that a current stage has beencompleted for the virtual machines at its computing device, the agentreleases its shared lock on the end stage lock file for that stage(e.g., removes the corresponding entry from the lock file). When all ofthe agents have released their shared locks on the end stage lock filefor that stage (e.g., all shared lock entries removed from the end stagelock file), log switching manager 904 is enabled to take an exclusivelock on the end stage lock file, indicating the stage as completed.

For example, with reference to FIG. 12, log switching manager 904attempts to take locks on end stage lock file 1108a. If any shared locksare maintained on end stage lock file 1108a, log switching manager 904cannot take an exclusive lock on end stage lock file 1108a. As shown inFIG. 12, agent 708a has completed the first stage, and thus releasesshared lock 1112a (shown in FIG. 11) on end stage lock file 1108a. Agent708b has not yet completed the first stage, so shared lock 1112b isstill present on end stage lock file 1108a, and log switching manager904 still cannot take exclusive lock. When shared lock 1112b is releasedby agent 708b, and any further shared locks on end stage lock file 1108aare released by any further agents, log switching manager 904 is enabledto take an exclusive lock on end stage lock file 1108a, shown asexclusive lock 1204.

From step 1304 of FIG. 13, operation proceeds to step 1306.

In step 1306, whether the exclusive lock of step 1304 was taken isdetermined. If the exclusive lock is able to be taken on the end stagelock file for the stage by log switching manager 904, the agents havesignaled that they have completed the functions of the stage, andoperation proceeds to step 1308. If the exclusive lock is not able to betaken on the end stage lock file for the stage by log switching manager904, one or more of the agents have not completed the functions of thestage, and operation proceeds back to step 1304.

In step 1308, whether the current stage is the last stage of the logswitching cycle is determined. If the current stage is the last stage ofthe log switching cycle (e.g., fourth stage 418 of the four stageprocess 400 of FIG. 4), operation proceeds to step 1310. If the currentstage is not the last stage of the log switching cycle, operationproceeds to step 1312.

In step 1310, the next stage is transitioned to. When the current stageis completed, operation proceeds to step 1302, where log switchingmanager 904 initiates the next stage of the log switching cycle.

In step 1312, the log switching cycle is complete. When all stages ofthe log switching cycle have been performed, the log switching cycle iscomplete.

Accordingly, in the manner of flowchart 1300, log switching manager 904of WOCT coordinator 900 may coordinate a cycle of log switching (step804 of FIG. 8). As described above, log switching manager 904 of WOCTcoordinator 900 may be configured to coordinate a cycle of log switchingin other ways, such as through the use of control codes and/or othertypes of messages than control codes. For instance, FIG. 14 shows a step1402 for using control codes to coordinate log switching, according toan example embodiment. Step 1402 is an example embodiment for step 804of FIG. 8. In step 1402, control codes are transmitted to a plurality ofagents at the computing devices to enact the plurality of stages. A WOCTcoordinator may be configured to use control codes to coordinate stagesof a log switching process in any manner.

For instance, with reference to FIG. 7, system 700 may be a clustercomputing system or network. A cluster computing network includes a setof loosely connected or tightly connected nodes/computers (e.g.,computing devices 702, 704a, 704b, etc.) that work together so that inmany respects they can be viewed as a single system. The components of acluster are usually connected to each other through fast local areanetworks (“LAN”), which may be referred to as a dedicated clustercommunication network, with each node running its own instance of anoperating system. Computer clusters are enablers for high performancedistributed computing. “High-availability clusters” (also known asfailover clusters, or HA clusters) are a type of computer cluster thatincludes redundant nodes, which are then used to provide service whensystem components fail. In a computer cluster, a heartbeat network maybe present that is a private network shared by the cluster nodes, andused so that the cluster nodes can monitor the status of each other, andto communicate with each other (e.g., using control codes or “clustercodes”). According to the heartbeat mechanism, every node sends amessage in a given interval (a “heartbeat”), referred to as a delta, toconfirm that the node is alive. A receiver node called a “sink”maintains an ordered list of the messages. Once a message with atimestamp later than a marked time is received from every node, thesystem determines that all of the nodes are functioning.

Accordingly, in an embodiment, WOCT coordinator 706 may be configured tocommunicate with agents 708a, 708b, etc. using control codes over aheartbeat network. In other words, as shown in FIG. 11, log switchinginitiation instruction 1116 may be transmitted by log switchinginitiator 902 to agents 706a, 706b, etc. over a heartbeat network of acluster network. Furthermore, responses 1202a, 1202b, etc. may betransmitted by agents 706a, 706b, etc. to log switching initiator 902over the heartbeat network. Still further, log switching initiator 902and agents 706a, 706b, etc. may communicate with each other over theheartbeat network using control codes and/or messages to coordinateperformance of the stages of a cycle of log switching, rather than usinglock files (as in FIGS. 10-12, and related text herein).

For example, FIG. 15 shows a flowchart 1500 providing a process forusing control codes to coordinate a stage of log switching, according toan example embodiment. In an embodiment, log switching manager 904 (FIG.9) may perform flowchart 1500 (e.g., to perform step 1402 of FIG. 14).Flowchart 1500 is described as follows with respect to FIG. 7. Furtherstructural and operational embodiments will be apparent to personsskilled in the relevant art(s) based on the following description.

Flowchart 1500 starts with step 1502. In step 1502, a time period forperforming the log switching is initiated. Step 1502 (and step 1508) isoptional. In an embodiment, log switching manager 904 may maintain apredetermined time period that is a length of time by which a full cycleof log switching is to be performed, or else the cycle is aborted (nolog switching occurs). The time period may be preconfigured to have anylength of time suitable for a particular network configuration (e.g.,100 microseconds, etc.). Operation proceeds from step 1502 to step 1504.

In step 1504, a control code is transmitted to the plurality of agents.In step 1504, a control code may be transmitted by log switching manager904 to agents 708a, 708b, etc. over the cluster network. The controlcode is recognized by the agents to signify a start to a stage. In oneembodiment, a same control code is used to initiate all stages. Inanother embodiment, each stage may have its own control code configuredto initiate the stage at the agents.

For instance, the first stage (e.g., first stage 412 of FIG. 4) may bedesired to be performed. In such case, log switching manager 904 maytransmit a control code to agents 708a, 708b, etc. When the control codeis received by the agents, this signals to the agents that the functionsof the first stage may be performed. In such case, the agents may havethe functions of the first stage performed with respect to the log filesassociated with its virtual machines. In a similar manner, the switchingmanager 904 may transmit a control code to agents 708a, 708b, etc. tosignal to the agents that the functions of the second stage, thirdstage, fourth stage, etc. may be performed.

In step 1506, a response to the transmitted control code is awaited fromeach of the plurality of agents. In an embodiment, log switching manager904 may await a response to the transmitted control code from each ofagents 708a, 708b, etc. Agents 708a, 708b, etc. may transmits theresponses in any form (e.g., as response control codes, etc.) to logswitching manager 904 through the cluster network. When responses fromall agents are received by log switching manager 904 indicating successin performing the stage at the various computing devices, operationproceeds to step 1506. If a response is not received from an agent, thismay indicate a failure to perform a stage at the corresponding computingdevice, or may indicate some other failure (e.g., a communicationfailure, agent going down, etc.). In such case, operation may optionallyproceed to step 1510 where the log switching cycle is aborted, oroperation may proceed to step 1506, with logs at computing devices ofany non-responsive agents (and/or agents responding with stage failures)potentially not being switched during the current log switching cycle.

In step 1508, whether the time period has expired before all responsesto the transmitted control codes are received is determined. Asindicated in step 1502 above, step 1508 is optional. In an embodiment,during performance of a log switching cycle, log switching manager 904may periodically check whether the time period initiated in step 1502has expired. If the time period has expired, operation proceeds to step1510. If the time period has not expired, operation proceeds to step1512.

In step 1510, the log switching is aborted if at least one of the agentsdoes not respond with the awaited response within a predetermined timeperiod. In an embodiment, step 1510 may be performed by log switchingmanager 904, to end the log switching cycle without any logs beingswitched. Operation of flowchart 1510 completes after step 1510.

In step 1512, whether the current stage is the last stage of the logswitching cycle is determined. If the current stage is the last stage ofthe log switching cycle (e.g., fourth stage 418 of the four stageprocess 400 of FIG. 4), operation proceeds to step 1516. If the currentstage is not the last stage of the log switching cycle, operationproceeds to step 1514.

In step 1514, a next stage is transitioned to for enactment. When thecurrent stage is completed, operation proceeds to step 1504, where logswitching manager 904 initiates the next stage of the log switchingcycle.

In step 1516, the log switching cycle is complete. When all stages ofthe log switching cycle have been performed (within the optional timeperiod), the log switching cycle is complete.

It is noted that although lock files and control codes are presented asexample techniques for coordinating log switching, other techniques forcoordinating log switching may become apparent to persons skilled in therelevant art(s) from the teachings herein, which are encompassed asembodiments. Furthermore, embodiments may be combined in any manner. Forinstance, in an embodiment, a WOCT coordinator may implement both thelock file technique (e.g., FIGS. 10-13) and the control code technique(e.g., FIGS. 14 and 15) simultaneously in a computer network. In such anembodiment, for each stage, the approach that works more quickly (e.g.,an indication that a stage is complete is provided more quickly) can beused to move to the next stage more quickly. For instance, in somesituations, the lock file approach may finish a stage (e.g., agentsreleasing shared locks on the end stage lock file) faster than a controlcode approach is able to finish the stage (e.g., agents responding to areceived control code). In other situations, the control code approachmay be able to finish the stage faster than the lock file approach isable to finish a stage.

Accordingly, according to embodiments, log switching is enhanced toachieve write order preservation across multiple servers by introducingsynchronization between change tracking mechanisms in different servers.To have minimal impact on servers, the synchronization is achievedwithout pausing VM operation, and without having to perform IO duringsynchronization, and is finished within relatively short amounts oftime. The synchronization is performed in a manner that is notcontinuous, but is performed at particular time intervals as specifiedby a desired product replication frequency. In an embodiment, thesynchronization success rate may only be limited by the speed ofcommunication between server and specified timeouts by the product.Accordingly, a generic framework is defined to enable multiplecommunication channels between servers to achieve synchronization asquickly as possible.

IV. Example Embodiments for Replication of a Multi-Stream ApplicationBased on Replication Logs

As described above, embodiments are provided for replication of amulti-stream application (e.g., an application that generates multipleseparate streams of data, which may be stored separately). According toembodiments, the stored data of an application may be replicated andmaintained in sync with the primary stored data, by applying theswitched out logs generated in the prior section to replica storage atparticular times. Such embodiments may be implemented in various ways.

For example, FIG. 16 shows a block diagram of a system 1600 thatincludes replication coordinators to coordinate log switching and theapplication of virtual disk replication logs to replica storage,according to example embodiments. As shown in FIG. 16, system 1600 issimilar to FIG. 7, including computing device 702 and computing device704a (computing device 704b, and any further computing devices, are notshown in FIG. 16 for ease of illustration). Computing device 702includes WOCT coordinator 706 as in FIG. 7, and computing device 704a isconfigured as shown in FIG. 7. In an embodiment, WOCT coordinator 706may coordinate switching of logs at computing device 704a bycommunicating with agent 708a, and switching of logs at furthercomputing devices by communicating with corresponding agents, asdescribed elsewhere herein. Furthermore, in FIG. 16, system 1600includes a computing device 1602, a computing device 1604a (and optionalfurther computing devices), a replica storage 1614a, and a replicastorage 1614b. Still further, computing device 702 includes a firstreplication coordinator 1606, computing device 1604a includes a log fileprocessing agent 1612a, replica storage 1614a includes at least onevirtual disk 1616a, and replica storage 1614b includes at least onevirtual disk 1616b.

In FIG. 16, computing device 702, computing device 704a, storage 710a,and storage 710b are considered primary side or primary-site componentsas indicated by primary side 1624, and computing device 1602, computingdevice 1604a, replica storage 1614a, and replica storage 1614b areconsidered secondary side, secondary-site, or replica side components asindicated by replica side 1626. This is because replica storage 1614a isreplica storage for storage 710a, with virtual disk(s) 1616a being areplica of virtual disk(s) 106a, and replica storage 1614b is replicastorage for storage 710b, with virtual disk(s) 1616b being a replica ofvirtual disk(s) 106b. Each storage instance associated with a virtualmachine at a computing device on primary side 1624 has a counterpartreplica storage on replica side 1626. Furthermore, computing device1604a (e.g., a server, etc.) is the replica side counterpart to computerdevice 704a.

In an embodiment, first and second replication coordinators 1606 and1608 work together to replicate data on primary side 1624 to replicaside 1616 using the replication logs generated according to thetechniques described else-where herein. Accordingly, multiple storageinstances (e.g., virtual disks) that store related data may bereplicated to replica side 1616 simultaneously, which assists inmaintaining write order consistency.

First and second replication coordinators 1606 and 1608 may operate invarious ways to perform their functions. For instance, FIG. 17 shows aflowchart 1700 providing a process for coordinating log switching andthe application of virtual disk replication logs to replica storage,according to an example embodiment. In embodiments, first replicationcoordinator 1606 may operate according to flowchart 1700, secondreplication coordinator 1608 may operate according to flowchart 1700, orreplication coordinators 1606 and 1608 may cooperate to performflowchart 1700. Flowchart 1700 is described as follows with respect toFIG. 16. Further structural and operational embodiments will be apparentto persons skilled in the relevant art(s) based on the followingdescription.

Flowchart 1700 begins with step 1702. In step 1702, an instruction istransmitted to perform a cycle of log switching of a plurality of logsassociated with a first plurality of virtual disks at a plurality ofcomputing devices. For example, as shown in FIG. 16, replicationcoordinator 1608 at computing device 1602 (replica side 1626) maygenerate a log switching instruction 1628 that is received over anetwork by replication coordinator 1606 at computing device 702 (primaryside 1624). Log switching instruction 1628 is an instruction to performlog switching at the various computing devices containing storageassociated with tracking logs. Log switching instruction 1628 may betransmitted in any manner, such as a control code (e.g., over a clusternetwork), an HTTP request (e.g., over a LAN, WAN, etc.) or in anotherform.

As shown in FIG. 16, in response to receiving log switching instruction1628, replication coordinator 1606 may generate a second log switchinginstruction 1618, which is received by WOCT coordinator 706. Logswitching instruction 1618 instructs WOCT coordinator 706 to perform acycle of log switching. In response to receiving log switchinginstruction 1618, WOCT coordinator 706 may cause a cycle of logswitching to be performed in any manner described herein, such asaccording to flowchart 800 (FIG. 8), etc.

Note that in another embodiment, replication coordinator 1606 maygenerate log switching instruction 1618 without having received logswitching instruction 1628 from replication coordinator 1608. Inembodiments, replication coordinator 1606 and/or replication coordinator1608 may cause a cycle of log switching to be performed (e.g., bygenerating a log switching instruction) at any time, which may beperiodically, at random times, at predetermined times (e.g., accordingto a schedule), when VHD request queues 322 and/or log request queues342 (FIG. 3) are becoming full, based on an amount of storage traffic(e.g., perform log switching more frequently when data storage eventsare occurring more often), and/or in any other manner.

In step 1704, a plurality of logs is received from the computing devicesin response to performance of the cycle of log switching. For instance,as shown in FIG. 16, replication coordinator 1608 at computing device1602 (replica side 1626) receives replication log(s) 1622a fromcomputing device 704a, replication log(s) 1622b from computing device704b (FIG. 7; not shown in FIG. 16), and may receive further replicationlogs from further computing devices on primary side 1624. Thereplication logs are the logs that were switched-out due to performanceof log switching as described elsewhere herein, in response to step1702. Accordingly, the replication logs relate to data stored in primarystorage, across multiple computing devices and virtual disks, and thatmay need write order dependency maintained storage-wide.

The replication logs may be received from agents at the computingdevices (e.g., agent 708a, etc.), from log request processing modules344 at the computing devices, and/or from other source at the computingdevices. In such an embodiment, the replication logs are receiveddirectly and individually from the computing devices where the logs weregenerated and switched out, through multiple channels, rather thancollecting the replication logs at one point. This may enable fasterproviding of the replication logs to replication coordinator 1608(replica side 1626), rather than collecting the replication logs atreplication coordinator 1606 (primary side 1624), and then passing themto replication coordinator 1608, although this may be done in analternative embodiment. Receiving the logs from the individual computingdevices enables greater scalability for system 1600.

Referring back to FIG. 17, in step 1706, each log of the receivedplurality of logs is tagged to at least indicate the cycle of logswitching. In an embodiment, replication coordinator 1608 may include alog file tagger 1610. Log file tagger 1610 is configured to tag eachreceived replication log at least with information that identifies theparticular cycle of log switching (e.g., with a cycle identifier/code).In this manner, the replication logs of a particular cycle may beapplied to replica storage at a same time to enable write order to bemaintained. Log file tagger 1610 may tag the replication logs in anymanner, such as by providing an indication of the log switching cycle ina header of the log file, in the body of the log file, as metadataassociated with the log file, in a file name of the log file, and/or inany other manner.

Note that in another embodiment, each agent 708a, etc. may include a logfile tagger 1610 that tags replication logs 1622a, 1622b, etc. prior tobeing transmitted from primary side 1624. In still another embodiment,each computing device 704a, etc. may include a log file tagger 1610 thatis separate from the corresponding agent 708a, etc. at the computingdevice.

In step 1708, the tagged plurality of logs is provided to enable awrite-order consistent storage point in a second plurality of virtualdisks. In an embodiment, as shown in FIG. 16, replication coordinator1608 may transmit tagged replication logs 1624, which includes versionsof replication logs 1622a, 1622b, etc. that have been tagged with logswitching cycle identifiers. In an embodiment, log file processing agent1612a at computing device 1604a, and further log file processing agentsat further computing devices, may receive tagged replication logs 1624.Tagged replication logs 1624 may be transmitted to the log fileprocessing agents in any manner, such as being transmitted over acluster network, over a LAN, WAN, etc., or in another form.

In an embodiment, each computing device receives one or more taggedreplication logs of tagged replication logs 1624 that is/are applicableto the replica storage associated therewith. For instance, in theexample of FIG. 16, computing device 1604a may receive tagged versionsof replication logs 112a and 112b switched out from computing device704a, in the case where virtual disk(s) 1616a of replica storage 1614acorrespond to virtual disk(s) 106a of storage 710a, and virtual disk(s)1616b of replica storage 1614b correspond to virtual disk(s) 106b ofstorage 710b.

In an embodiment, log file processing agents 1612a, etc. at therespective computing devices apply the storage access requests includedin the received tagged replication logs to the corresponding virtualdisks in virtual storage. Replication coordinator 1608 coordinatesapplying of the replication logs such that replication logs that aretagged with the same cycle are applied by log file processing agents1612a, etc. in parallel. Replication coordinator 1608 may require thatlog file processing agents 1612a, etc. all transmit aconfirmation/response to replication coordinator 1608 that each of theirreplication logs were successfully applied to their replica storagebefore replication coordinator 1608 will allow tagged replication logsfrom a next cycle of log switching on primary side 1624 to begin to beapplied to replica storage on replica side 1626 by the log fileprocessing agents.

In this manner, the data in the virtual disks is updated andsynchronized with the corresponding virtual disks in primary storage ata point in time (e.g., the time at which log switching is initiated fora cycle). For instance, a tagged version of a replication log 112aswitched out from storage 710a may include storage access requests thatwere applied to one of virtual disk(s) 106a. Log file processing agent1612a is configured to apply the storage access request (e.g., datawrites) to the corresponding one of virtual disk(s) 1616a in replicastorage 1614a. In this manner, the replica virtual disk of virtualdisk(s) 1616a is brought forward in time to synchronization with thecorresponding primary virtual disk of virtual disk(s) 106a (assumingfurther writes have not been performed on the primary virtual disk). Assuch, a multi-VM write-order consistent point is created in replicastorage 1614a, 1614b, etc. on replica side 1626. The tagging enablesgathering/collation on replica side 1626 of all logs created in the samelog switching cycle on primary side 1624. Log file processing agentsacross the computing devices on replica side 1626 perform similaroperations on their replica storage to bring their respective virtualdisks in synchronization with the corresponding virtual disks in primarystorage.

Accordingly, embodiments provide a consistent point-in-time for anapplication distributed across multiple hosts to orchestrate replicationof the data streams from primary-site hosts to secondary-site (replica)hosts. The orchestrator (e.g., replication coordinator 1606) for eachmulti-stream application on the primary-site coordinates with itscounterpart (e.g., replication coordinator 1608) on the secondary-siteto initiate and drive replication cycles. For achieving near-sync RPO(recovery point objective), the orchestration mechanism imposesrelatively little overhead and provides high parallelism of replicationchannels by offloading the actual data transfer between primary andsecondary hosts of each data stream. In one application, the logswitching techniques of the prior section may be leveraged to produce awrite order consistent point-in-time across the data streams. Thestorage changes are replicated according to embodiments of the currentsection to the target replica storage, and create a point-in-time copyfor purposes of failover/test, failover etc. For a group of computingdevices, it is desirable to identify recovery points in time across thereplication streams of all computing devices that belong to the samereplication cycle. An orchestrator (replication coordinator) on thesecondary site keeps track of changes received at all computing devicesof a group on the primary side during a replication cycle, determines ifa recovery point can be produced, and keeps track of all such recoverypoints suitable for failover of a corresponding computing device groupon the secondary site. The orchestrator is also resilient to one or morecomputer devices falling behind, or failing completely, and provides aframework for replication of all computing devices in a group to besynchronized.

Note that in embodiments, primary side and replica side replicationcoordinators may be configured to handle failures at each stage, such asa failure during the generation of write-order consistent logs duringlog switching, a failure to transmit a subset/all logs to the replicaside, a failure to apply logs on the replica side, etc. Furthermore,when a subset of primary side computing devices fail to participate in alog switching/replication cycle, the replication coordinators may beconfigured to support the transmitting of the logs of/from the other(non-failing) primary side computing devices while the failing subsetauto-recovers (or is manually recovered). These features, when present,enable the multi-stream replication scheme to be fault-tolerant, havingan ability to (which may be automatic) recover from such failures thatare common and/or expected in a distributed system.

V. Example Mobile and Stationary Device Embodiments

Virtual machine 104, SRPM 108, RMM 110, storage write control module208, VHD interface 304, VHD parser 306, VHD interface 308, IOCTL handler312, VHD request processing module 324, log request processing module344, VMMS 340, WOCT coordinator 706, agent 708a, agent 708b, WOCTcoordinator 900, log switching initiator 902, log switching manager 904,replication coordinator 1606, replication coordinator 1608, log filetagger 1610, log file processing agent 1612a, process 400, state diagram500, process 600, flowchart 700, flowchart 800, flowchart 1000,flowchart 1300, step 1402, flowchart 1500, and flowchart 1700 may beimplemented in hardware, or hardware combined with software and/orfirmware. For example, virtual machine 104, SRPM 108, RMM 110, storagewrite control module 208, VHD interface 304, VHD parser 306, VHDinterface 308, IOCTL handler 312, VHD request processing module 324, logrequest processing module 344, VMMS 340, WOCT coordinator 706, agent708a, agent 708b, WOCT coordinator 900, log switching initiator 902, logswitching manager 904, replication coordinator 1606, replicationcoordinator 1608, log file tagger 1610, log file processing agent 1612a,process 400, state diagram 500, process 600, flowchart 700, flowchart800, flowchart 1000, flowchart 1300, step 1402, flowchart 1500, and/orflowchart 1700 may be implemented as computer program code/instructionsconfigured to be executed in one or more processors and stored in acomputer readable storage medium. Alternatively, virtual machine 104,SRPM 108, RMM 110, storage write control module 208, VHD interface 304,VHD parser 306, VHD interface 308, IOCTL handler 312, VHD requestprocessing module 324, log request processing module 344, VMMS 340, WOCTcoordinator 706, agent 708a, agent 708b, WOCT coordinator 900, logswitching initiator 902, log switching manager 904, replicationcoordinator 1606, replication coordinator 1608, log file tagger 1610,log file processing agent 1612a, process 400, state diagram 500, process600, flowchart 700, flowchart 800, flowchart 1000, flowchart 1300, step1402, flowchart 1500, and/or flowchart 1700 may be implemented ashardware logic/electrical circuitry.

For instance, in an embodiment, one or more, in any combination, ofvirtual machine 104, SRPM 108, RMM 110, storage write control module208, VHD interface 304, VHD parser 306, VHD interface 308, IOCTL handler312, VHD request processing module 324, log request processing module344, VMMS 340, WOCT coordinator 706, agent 708a, agent 708b, WOCTcoordinator 900, log switching initiator 902, log switching manager 904,replication coordinator 1606, replication coordinator 1608, log filetagger 1610, log file processing agent 1612a, process 400, state diagram500, process 600, flowchart 700, flowchart 800, flowchart 1000,flowchart 1300, step 1402, flowchart 1500, and/or flowchart 1700 may beimplemented together in a SoC. The SoC may include an integrated circuitchip that includes one or more of a processor (e.g., a centralprocessing unit (CPU), microcontroller, microprocessor, digital signalprocessor (DSP), etc.), memory, one or more communication interfaces,and/or further circuits, and may optionally execute received programcode and/or include embedded firmware to perform functions.

FIG. 18 depicts an exemplary implementation of a computing device 1800in which embodiments may be implemented. For example, system 100, system200, computing device 702, computing device 704a, computing device 704b,computing device 1602, and/or computing device 1604a may be implementedin one or more computing devices similar to computing device 1800 inmobile or stationary computer embodiments, including one or morefeatures of computing device 1800 and/or alternative features. Thedescription of computing device 1800 provided herein is provided forpurposes of illustration, and is not intended to be limiting.Embodiments may be implemented in further types of computer systems, aswould be known to persons skilled in the relevant art(s).

As shown in FIG. 18, computing device 1800 includes one or moreprocessors, referred to as processor circuit 1802, a system memory 1804,and a bus 1806 that couples various system components including systemmemory 1804 to processor circuit 1802. Processor circuit 1802 is anelectrical and/or optical circuit implemented in one or more physicalhardware electrical circuit device elements and/or integrated circuitdevices (semiconductor material chips or dies) as a central processingunit (CPU), a microcontroller, a microprocessor, and/or other physicalhardware processor circuit. Processor circuit 1802 may execute programcode stored in a computer readable medium, such as program code ofoperating system 1830, application programs 1832, other programs 1834,etc. Bus 1806 represents one or more of any of several types of busstructures, including a memory bus or memory controller, a peripheralbus, an accelerated graphics port, and a processor or local bus usingany of a variety of bus architectures. System memory 1804 includes readonly memory (ROM) 1808 and random access memory (RAM) 1810. A basicinput/output system 1812 (BIOS) is stored in ROM 1808.

Computing device 1800 also has one or more of the following drives: ahard disk drive 1814 for reading from and writing to a hard disk, amagnetic disk drive 1816 for reading from or writing to a removablemagnetic disk 1818, and an optical disk drive 1820 for reading from orwriting to a removable optical disk 1822 such as a CD ROM, DVD ROM, orother optical media. Hard disk drive 1814, magnetic disk drive 1816, andoptical disk drive 1820 are connected to bus 1806 by a hard disk driveinterface 1824, a magnetic disk drive interface 1826, and an opticaldrive interface 1828, respectively. The drives and their associatedcomputer-readable media provide nonvolatile storage of computer-readableinstructions, data structures, program modules and other data for thecomputer. Although a hard disk, a removable magnetic disk and aremovable optical disk are described, other types of hardware-basedcomputer-readable storage media can be used to store data, such as flashmemory cards, digital video disks, RAMs, ROMs, and other hardwarestorage media.

A number of program modules may be stored on the hard disk, magneticdisk, optical disk, ROM, or RAM. These programs include operating system1830, one or more application programs 1832, other programs 1834, andprogram data 1836. Application programs 1832 or other programs 1834 mayinclude, for example, computer program logic (e.g., computer programcode or instructions) for implementing virtual machine 104, SRPM 108,RMM 110, storage write control module 208, VHD interface 304, VHD parser306, VHD interface 308, IOCTL handler 312, VHD request processing module324, log request processing module 344, VMMS 340, WOCT coordinator 706,agent 708a, agent 708b, WOCT coordinator 900, log switching initiator902, log switching manager 904, replication coordinator 1606,replication coordinator 1608, log file tagger 1610, log file processingagent 1612a, process 400, state diagram 500, process 600, flowchart 700,flowchart 800, flowchart 1000, flowchart 1300, step 1402, flowchart1500, and/or flowchart 1700 (including any suitable step of processes400, 600, state machine 500, flowcharts 700, 800, 1000, 1300, 1500,1700), and/or further embodiments described herein.

A user may enter commands and information into the computing device 1800through input devices such as keyboard 1838 and pointing device 1840.Other input devices (not shown) may include a microphone, joystick, gamepad, satellite dish, scanner, a touch screen and/or touch pad, a voicerecognition system to receive voice input, a gesture recognition systemto receive gesture input, or the like. These and other input devices areoften connected to processor circuit 1802 through a serial portinterface 1842 that is coupled to bus 1806, but may be connected byother interfaces, such as a parallel port, game port, or a universalserial bus (USB).

A display screen 1844 is also connected to bus 1806 via an interface,such as a video adapter 1846. Display screen 1844 may be external to, orincorporated in computing device 1800. Display screen 1844 may displayinformation, as well as being a user interface for receiving usercommands and/or other information (e.g., by touch, finger gestures,virtual keyboard, etc.). In addition to display screen 1844, computingdevice 1800 may include other peripheral output devices (not shown) suchas speakers and printers.

Computing device 1800 is connected to a network 1848 (e.g., theInternet) through an adaptor or network interface 1850, a modem 1852, orother means for establishing communications over the network. Modem1852, which may be internal or external, may be connected to bus 1806via serial port interface 1842, as shown in FIG. 18, or may be connectedto bus 1806 using another interface type, including a parallelinterface.

As used herein, the terms “computer program medium,” “computer-readablemedium,” and “computer-readable storage medium” are used to generallyrefer to physical hardware media such as the hard disk associated withhard disk drive 1814, removable magnetic disk 1818, removable opticaldisk 1822, other physical hardware media such as RAMs, ROMs, flashmemory cards, digital video disks, zip disks, MEMs, nanotechnology-basedstorage devices, and further types of physical/tangible hardware storagemedia. Such computer-readable storage media are distinguished from andnon-overlapping with communication media (do not include communicationmedia). Communication media typically embodies computer-readableinstructions, data structures, program modules or other data in amodulated data signal such as a carrier wave. The term “modulated datasignal” means a signal that has one or more of its characteristics setor changed in such a manner as to encode information in the signal. Byway of example, and not limitation, communication media includeswireless media such as acoustic, RF, infrared and other wireless media,as well as wired media. Embodiments are also directed to suchcommunication media.

As noted above, computer programs and modules (including applicationprograms 1832 and other programs 1834) may be stored on the hard disk,magnetic disk, optical disk, ROM, RAM, or other hardware storage medium.Such computer programs may also be received via network interface 1850,serial port interface 1842, or any other interface type. Such computerprograms, when executed or loaded by an application, enable computingdevice 1800 to implement features of embodiments discussed herein.Accordingly, such computer programs represent controllers of thecomputing device 1800.

Embodiments are also directed to computer program products comprisingcomputer code or instructions stored on any computer-readable medium.Such computer program products include hard disk drives, optical diskdrives, memory device packages, portable memory sticks, memory cards,and other types of physical storage hardware.

VI. Example Embodiments

In one embodiment, a method in a write order consistent tracking (WOCT)coordinator is provided, comprising: initiating a cycle of a logswitching of a plurality of logs associated with a plurality of virtualdisks at a plurality of computing devices, the virtual disks storingdata that is write order dependent amongst the virtual disks, eachcomputing device of the plurality of computing device including at leastone of a virtual disk of the plurality of virtual disks that receivesstorage access requests from an application, the storage access requestsincluding write requests, and a log of the plurality of logscorresponding to the virtual disk that receives log queue entriescorresponding to the storage access requests; and coordinating the cycleof the log switching of the plurality of logs at the plurality ofcomputing devices across the virtual disks to maintain request orderingfor write order dependent requests.

In an embodiment, the coordinating comprises: enacting a plurality ofstages to cause the switching of the plurality of logs at the pluralityof computing devices.

In an embodiment, the initiating comprises: taking an exclusive lock oneach of a plurality of begin stage lock files, each begin stage lockfile associated with a corresponding stage of the plurality of stages;and transmitting a log switching initiation instruction to a pluralityof agents at the computing devices, each computing device including acorresponding agent of the plurality of agents.

In an embodiment, the initiating further comprises: receiving a responsefrom each of the agents, each response received from an agent of theplurality agents indicating that the agent took a shared lock on each ofa plurality of end stage lock files, each end stage lock file associatedwith a corresponding stage of the plurality of stages.

In an embodiment, each stage of the plurality of stages is enacted byperforming releasing the exclusive lock on the associated begin stagelock file to signal the beginning of the stage to the agents, attemptingto take an exclusive lock on the associated end stage lock file, takingthe exclusive lock on the associated end stage lock file when enabled bythe agents having released all shared locks on the associated end stagelock file to signify completion of the stage by the agents, andtransitioning to enacting a next stage until a final stage of theplurality of stages is completed.

In an embodiment, the enacting a plurality of stages comprises: enactinga first stage during which a new log is initialized at each computingdevice of the plurality of computing devices; enacting a second stageduring which received log queue entries are blocked from being receivedby the logs at the plurality of computing devices; enacting a thirdstage during which the new log is configured to be used to receive thelog queue entries at each computing device of the plurality of computingdevices, and received log queue entries are unblocked from beingreceived by the logs at the plurality of computing devices; and enactinga fourth stage during which the log switching is finalized.

In an embodiment, the coordinating comprises: transmitting control codesand/or messages to a plurality of agents at the computing devices toenact the plurality of stages, each computing device including acorresponding agent of the plurality of agents.

In an embodiment, each stage of the plurality of stages is enacted byperforming transmitting a control code to the plurality of agents;awaiting a response to the transmitted control code from each of theplurality of agents; aborting the log switching if at least one of theagents does not respond with the awaited response within a predeterminedtime period for the plurality of stages to be completed; andtransitioning to enacting a next stage if all agents respond within thepredetermined time period, said transitioning including completing thelog switching when a final stage of the plurality of stages iscompleted.

In another embodiment, a write order consistent tracking (WOCT)coordinator comprises: a log switching initiator configured tocommunicate with a plurality of agents at a plurality of computingdevices to initiate cycles of a log switching of a plurality of logsassociated with a plurality of virtual disks at the plurality ofcomputing devices, a cycle of the log switching including a switchingout of each current log for a corresponding new log, each computingdevice of the plurality of computing device including at least one of avirtual disk of the plurality of virtual disks that receives storageaccess requests from an application, the storage access requestsincluding write requests, and a log of the plurality of logscorresponding to the virtual disk that receives log queue entriescorresponding to the storage access requests; and a log switchingmanager configured to coordinate the cycles of the log switching of theplurality of logs at the plurality of computing devices to maintainrequest ordering for write order dependent requests across virtualdisks.

In an embodiment, the log switching manager is configured to enact aplurality of stages to cause a cycle of the switching of the pluralityof logs at the plurality of computing devices.

In an embodiment, for a cycle of the log switching, the log switchinginitiator is configured to: take an exclusive lock on each of aplurality of begin stage lock files, each begin stage lock fileassociated with a corresponding stage of the plurality of stages; andtransmit a log switching initiation instruction to the plurality ofagents at the computing devices to initiate the log switching.

In an embodiment, the log switching initiator is configured to receive aresponse from each of the agents, each response received from an agentof the plurality agents indicating that the agent took a shared lock oneach of a plurality of end stage lock files, each end stage lock fileassociated with a corresponding stage of the plurality of stages.

In an embodiment, to enact each stage of the plurality of stages, thelog switching manager is configured to: release the exclusive lock takenby the log switching initiator on the associated begin stage lock fileto signal the beginning of the stage to the agents, attempt to take anexclusive lock on the associated end stage lock file, take the exclusivelock on the associated end stage lock file when enabled by the agentshaving released all shared locks on the associated end stage lock fileto signify completion of the stage by the agents, and transition toenacting a next stage until a final stage of the plurality of stages iscompleted.

In an embodiment, to enact each stage of the plurality of stages, thelog switching manager is configured to: transmit a control code to theplurality of agents; await a response to the transmitted control codefrom each of the plurality of agents; abort the log switching if atleast one of the agents does not respond with the awaited responsewithin a predetermined time period for the plurality of stages to becompleted; and transition to enacting a next stage if all agents respondwithin the predetermined time period, the log switching being completedwhen a final stage of the plurality of stages is completed.

In another embodiment, a method in a replication coordinator isprovided, comprising: transmitting an instruction to perform a cycle oflog switching of a plurality of logs associated with a first pluralityof virtual disks at a plurality of computing devices on a primary side,the first plurality of virtual disks storing data of a distributedapplication, each log of the plurality of logs associated with a virtualdisk of the first plurality of virtual disks, each virtual disk of thefirst plurality of virtual disks configured to receive storage accessrequests from the distributed application, and the corresponding logconfigured to receive log queue entries corresponding to the storageaccess requests; receiving a plurality of logs from the computingdevices in response to performance of the cycle of log switching;tagging each log of the received plurality of logs to at least indicatethe cycle of log switching; providing the tagged plurality of logs toenable a write-order consistent storage point in a second plurality ofvirtual disks on a replica side, the write-order consistent storagepoint being a replica of the first plurality of virtual disks on theprimary side at a point in time, the storage access requests applicableto synchronize the second plurality of virtual disks with the firstplurality of virtual disks.

In an embodiment, the transmitting comprises: instructing a write orderconsistent tracking (WOCT) coordinator to coordinate the cycle of logswitching.

In an embodiment, the providing comprises: transmitting the taggedplurality of logs to a second replication coordinator configured tocoordinate application of the storage access requests to the secondplurality of virtual disks, the first and second replicationcoordinators each configured to handle failures, including at least oneof handling a failure during generation of the plurality of logs, afailure to receive a subset of the plurality of logs at the replicaside, or a failure to apply all of the plurality of logs to the secondplurality of virtual disks on the replica side.

In an embodiment, when a subset of the plurality of computing devices onthe primary side fails to participate the cycle of log switching, thefirst and second replication coordinators support transmitting theplurality of logs of others of the plurality of computing devices to thereplica side while the subset recovers.

In an embodiment, the providing comprises: providing the taggedplurality of logs to a plurality of agents at a second plurality ofcomputing devices to apply the storage access requests to the secondplurality of virtual disks; and the method further comprises: awaiting aconfirmation from the plurality of agents that the tagged plurality oflogs were successfully applied to the second plurality of virtual disks;and enabling a second set of tagged logs to be applied to the secondplurality of virtual disks in response to receiving the confirmationfrom the plurality of agents.

In an embodiment, the receiving a plurality of logs from the computingdevices in response to performance of the cycle of log switchingcomprises: receiving each log of the plurality of logs individually fromthe corresponding computing device of the plurality of computing deviceson a primary side.

VII. Conclusion

While various embodiments of the present invention have been describedabove, it should be understood that they have been presented by way ofexample only, and not limitation. It will be understood by those skilledin the relevant art(s) that various changes in form and details may bemade therein without departing from the spirit and scope of theinvention as defined in the appended claims. Accordingly, the breadthand scope of the present invention should not be limited by any of theabove-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

What is claimed is:
 1. A method in a write order consistent tracking(WOCT) coordinator, comprising: initiating a cycle of a log switching ofa plurality of logs associated with a plurality of virtual disks at aplurality of computing devices, the initiating including taking anexclusive lock on each of a plurality of begin stage lock files, thevirtual disks storing data that is write order dependent amongst thevirtual disks, each computing device of the plurality of computingdevices including at least one of a virtual disk of the plurality ofvirtual disks that receives storage access requests from an application,the storage access requests including write requests, and a log of theplurality of logs corresponding to the virtual disk that receives logqueue entries corresponding to the storage access requests; andcoordinating the cycle of the log switching of the plurality of logs atthe plurality of computing devices across the virtual disks to maintainrequest ordering for write order dependent requests.
 2. The method ofclaim 1, wherein said coordinating comprises: enacting a plurality ofstages to cause the switching of the plurality of logs at the pluralityof computing devices.
 3. The method of claim 2, wherein each begin stagelock file is associated with a corresponding stage of the plurality ofstages; and wherein said initiating comprises: transmitting a logswitching initiation instruction to a plurality of agents at thecomputing devices, each computing device including a corresponding agentof the plurality of agents.
 4. The method of claim 3, wherein saidinitiating further comprises: receiving a response from each of theagents, each response received from an agent of the plurality agentsindicating that the agent took a shared lock on each of a plurality ofend stage lock files, each end stage lock file associated with acorresponding stage of the plurality of stages.
 5. The method of claim4, wherein each stage of the plurality of stages is enacted byperforming: releasing the exclusive lock on the associated begin stagelock file to signal the beginning of the stage to the agents, attemptingto take an exclusive lock on the associated end stage lock file, takingthe exclusive lock on the associated end stage lock file when enabled bythe agents having released all shared locks on the associated end stagelock file to signify completion of the stage by the agents, andtransitioning to enacting a next stage until a final stage of theplurality of stages is completed.
 6. The method of claim 2, wherein saidcoordinating comprises: transmitting control codes and/or messages to aplurality of agents at the computing devices to enact the plurality ofstages, each computing device including a corresponding agent of theplurality of agents.
 7. The method of claim 6, wherein each stage of theplurality of stages is enacted by performing: transmitting a controlcode to the plurality of agents; awaiting a response to the transmittedcontrol code from each of the plurality of agents; aborting the logswitching if at least one of the agents does not respond with theawaited response within a predetermined time period for the plurality ofstages to be completed; and transitioning to enacting a next stage ifall agents respond within the predetermined time period, saidtransitioning including completing the log switching when a final stageof the plurality of stages is completed.
 8. A write order consistenttracking (WOCT) coordinator, comprising: at least one processor circuit;and memory that stores computer executable instructions for operationsperformed by the at least one processor circuit, the computer executableinstructions defining forming program code including: a log switchinginitiator configured to communicate with a plurality of agents at aplurality of computing devices to initiate cycles of a log switching ofa plurality of logs associated with a plurality of virtual disks at theplurality of computing devices and take an exclusive lock on each of aplurality of begin stage lock files, a cycle of the log switchingincluding a switching out of each current log for a corresponding newlog, each computing device of the plurality of computing devicesincluding at least one of a virtual disk of the plurality of virtualdisks that receives storage access requests from an application, thestorage access requests including write requests, and a log of theplurality of logs corresponding to the virtual disk that receives logqueue entries corresponding to the storage access requests; and a logswitching manager configured to coordinate the cycles of the logswitching of the plurality of logs at the plurality of computing devicesto maintain request ordering for write order dependent requests acrossvirtual disks.
 9. The WOCT coordinator of claim 8, wherein the logswitching manager is configured to enact a plurality of stages to causea cycle of the switching of the plurality of logs at the plurality ofcomputing devices.
 10. The WOCT coordinator of claim 9, wherein eachbegin stage lock file is associated with a corresponding stage of theplurality of stages; and wherein, for a cycle of the log switching, thelog switching initiator is configured to: transmit a log switchinginitiation instruction to the plurality of agents at the computingdevices to initiate the log switching.
 11. The WOCT coordinator of claim10, wherein the log switching initiator is configured to receive aresponse from each of the agents, each response received from an agentof the plurality agents indicating that the agent took a shared lock oneach of a plurality of end stage lock files, each end stage lock fileassociated with a corresponding stage of the plurality of stages. 12.The WOCT coordinator of claim 11, wherein to enact each stage of theplurality of stages, the log switching manager is configured to: releasethe exclusive lock taken by the log switching initiator on theassociated begin stage lock file to signal the beginning of the stage tothe agents, attempt to take an exclusive lock on the associated endstage lock file, take the exclusive lock on the associated end stagelock file when enabled by the agents having released all shared locks onthe associated end stage lock file to signify completion of the stage bythe agents, and transition to enacting a next stage until a final stageof the plurality of stages is completed.
 13. A method in a write orderconsistent tracking (WOCT) coordinator, comprising: initiating a cycleof a log switching of a plurality of logs associated with a plurality ofvirtual disks at a plurality of computing devices, the virtual disksstoring data that is write order dependent amongst the virtual disks,each computing device of the plurality of computing devices including atleast one of a virtual disk of the plurality of virtual disks thatreceives storage access requests from an application, the storage accessrequests including write requests, and a log of the plurality of logscorresponding to the virtual disk that receives log queue entriescorresponding to the storage access requests; and coordinating the cycleof the log switching of the plurality of logs at the plurality ofcomputing devices across the virtual disks to maintain request orderingfor write order dependent requests, said coordinating including enactinga plurality of stages to cause the switching of the plurality of logs atthe plurality of computing devices, said enacting a plurality of stagescomprising: enacting a first stage during which a new log is initializedat each computing device of the plurality of computing devices; enactinga second stage during which received log queue entries are blocked frombeing received by the logs at the plurality of computing devices;enacting a third stage during which the new log is configured to be usedto receive the log queue entries at each computing device of theplurality of computing devices, and received log queue entries areunblocked from being received by the logs at the plurality of computingdevices; and enacting a fourth stage during which the log switching isfinalized.
 14. A write order consistent tracking (WOCT) coordinator,comprising: at least one processor circuit; and memory that storescomputer executable instructions for operations performed by the atleast one processor circuit, the computer executable instructionsdefining forming program code including: a log switching initiatorconfigured to communicate with a plurality of agents at a plurality ofcomputing devices to initiate cycles of a log switching of a pluralityof logs associated with a plurality of virtual disks at the plurality ofcomputing devices, a cycle of the log switching including a switchingout of each current log for a corresponding new log, each computingdevice of the plurality of computing devices including at least one of avirtual disk of the plurality of virtual disks that receives storageaccess requests from an application, the storage access requestsincluding write requests, and a log of the plurality of logscorresponding to the virtual disk that receives log queue entriescorresponding to the storage access requests; and a log switchingmanager configured to: coordinate the cycles of the log switching of theplurality of logs at the plurality of computing devices to maintainrequest ordering for write order dependent requests across virtualdisks; and enact a plurality of stages to cause a cycle of the switchingof the plurality of logs at the plurality of computing devices; whereinto enact each stage of the plurality of stages, the log switchingmanager is configured to: transmit a control code to the plurality ofagents; await a response to the transmitted control code from each ofthe plurality of agents; abort the log switching if at least one of theagents does not respond with the awaited response within a predeterminedtime period for the plurality of stages to be completed; and transitionto enacting a next stage if all agents respond within the predeterminedtime period, the log switching being completed when a final stage of theplurality of stages is completed.
 15. A method in a first replicationcoordinator, comprising: transmitting an instruction to perform a cycleof log switching of a plurality of logs associated with a firstplurality of virtual disks at a plurality of computing devices on aprimary side, the first plurality of virtual disks storing data of adistributed application, each log of the plurality of logs associatedwith a virtual disk of the first plurality of virtual disks, eachvirtual disk of the first plurality of virtual disks configured toreceive storage access requests from the distributed application, andthe corresponding log configured to receive log queue entriescorresponding to the storage access requests; receiving a plurality oflogs from the computing devices in response to performance of the cycleof log switching; tagging each log of the received plurality of logs toat least indicate the cycle of log switching; and providing the taggedplurality of logs to enable a write-order consistent storage point in asecond plurality of virtual disks on a replica side by transmitting thetagged plurality of logs to a second replication coordinator, thewrite-order consistent storage point being a replica of the firstplurality of virtual disks on the primary side at a point in time, thestorage access requests applicable to synchronize the second pluralityof virtual disks with the first plurality of virtual disks.
 16. Themethod of claim 15, wherein said transmitting comprises: instructing awrite order consistent tracking (WOCT) coordinator to coordinate thecycle of log switching.
 17. The method of claim 15, wherein the secondreplication coordinator is configured to coordinate application of thestorage access requests to the second plurality of virtual disks, thefirst and second replication coordinators each configured to handlefailures, including at least one of handling a failure during generationof the plurality of logs, a failure to receive a subset of the pluralityof logs at the replica side, or a failure to apply all of the pluralityof logs to the second plurality of virtual disks on the replica side.18. The method of claim 17, wherein when a subset of the plurality ofcomputing devices on the primary side fails to participate the cycle oflog switching, the first and second replication coordinators supporttransmitting the plurality of logs of others of the plurality ofcomputing devices to the replica side while the subset recovers.
 19. Themethod of claim 15, wherein said providing comprises: providing thetagged plurality of logs to a plurality of agents at a second pluralityof computing devices to apply the storage access requests to the secondplurality of virtual disks; the method further comprising: awaiting aconfirmation from the plurality of agents that the tagged plurality oflogs were successfully applied to the second plurality of virtual disks;and enabling a second set of tagged logs to be applied to the secondplurality of virtual disks in response to receiving the confirmationfrom the plurality of agents.
 20. The method of claim 15, wherein saidreceiving comprises: receiving each log of the plurality of logsindividually from the corresponding computing device on the primaryside.
 21. A method in a first replication coordinator, comprising:transmitting an instruction to perform a cycle of log switching of logsassociated with a first plurality of virtual disks at a plurality ofcomputing devices on a primary side, each virtual disk of the firstplurality of virtual disks configured to receive storage accessrequests, and a log corresponding to the virtual disk configured toreceive log queue entries corresponding to the storage access requests;receiving logs from the computing devices in response to performance ofthe cycle of log switching; tagging each log of the received logs to atleast indicate the cycle of log switching; and transmitting the taggedlogs to a second replication coordinator to enable a write-orderconsistent storage point in a second plurality of virtual disks on areplica side.
 22. The method of claim 21, wherein said transmitting aninstruction comprises: instructing a write order consistent tracking(WOCT) coordinator to coordinate the cycle of log switching.
 23. Themethod of claim 21, wherein said transmitting the tagged logs comprises:transmitting the tagged logs to a second replication coordinatorconfigured to coordinate application of the storage access requests tothe second plurality of virtual disks, the first and second replicationcoordinators each configured to handle failures, including at least oneof handling a failure during generation of the logs, a failure toreceive a subset of the logs at the replica side, or a failure to applyall of the logs to the second plurality of virtual disks on the replicaside.
 24. The method of claim 23, wherein when a subset of the pluralityof computing devices on the primary side fails to participate in thecycle of log switching, the first and second replication coordinatorssupport transmitting the logs of others of the plurality of computingdevices to the replica side while the subset recovers.
 25. The method ofclaim 21, wherein said transmitting the tagged logs comprises: providingthe tagged logs to a plurality of agents at a second plurality ofcomputing devices to apply the storage access requests to the secondplurality of virtual disks; the method further comprising: awaiting aconfirmation from the plurality of agents that the tagged logs weresuccessfully applied to the second plurality of virtual disks; andenabling a second set of tagged logs to be applied to the secondplurality of virtual disks in response to receiving the confirmationfrom the plurality of agents.
 26. The method of claim 21, wherein saidreceiving comprises: receiving each log of the logs individually fromthe corresponding computing device on the primary side.
 27. The methodof claim 21, wherein the write-order consistent storage point is areplica of the first plurality of virtual disks on the primary side at apoint in time, the storage access requests applicable to synchronize thesecond plurality of virtual disks with the first plurality of virtualdisks.
 28. A system, comprising: at least one processor circuit; andmemory that stores computer executable instructions for operationsperformed by the at least one processor circuit, the computer executableinstructions forming program code including: a first replicationcoordinator configured to transmit an instruction to perform a cycle oflog switching of logs associated with a first plurality of virtual disksat a plurality of computing devices on a primary side, each virtual diskof the first plurality of virtual disks configured to receive storageaccess requests, and a log corresponding to the virtual disk configuredto receive log queue entries corresponding to the storage accessrequests, and receive logs from the computing devices in response toperformance of the cycle of log switching; and a log file taggerconfigured to tag each log of the received logs to at least indicate thecycle of log switching; and the first replication coordinator configuredto transmit the tagged logs to a second replication coordinator toenable a write-order consistent storage point in a second plurality ofvirtual disks on a replica side.
 29. The system of claim 28, wherein thefirst replication coordinator is configured to instruct a write orderconsistent tracking (WOCT) coordinator to coordinate the cycle of logswitching.
 30. The system of claim 28, wherein the first replicationcoordinator is configured to transmit the tagged logs to a secondreplication coordinator configured to coordinate application of thestorage access requests to the second plurality of virtual disks, thefirst and second replication coordinators each configured to handlefailures, including at least one of handling a failure during generationof the logs, a failure to receive a subset of the logs at the replicaside, or a failure to apply all of the logs to the second plurality ofvirtual disks on the replica side.
 31. The system of claim 30, whereinwhen a subset of the plurality of computing devices on the primary sidefails to participate in the cycle of log switching, the first and secondreplication coordinators support transmitting the logs of others of theplurality of computing devices to the replica side while the subsetrecovers.
 32. The system of claim 28, wherein the first replicationcoordinator is configured to provide the tagged logs to a plurality ofagents at a second plurality of computing devices to apply the storageaccess requests to the second plurality of virtual disks; the firstreplication coordinator is further configured to: await a confirmationfrom the plurality of agents that the tagged logs were successfullyapplied to the second plurality of virtual disks; and enable a secondset of tagged logs to be applied to the second plurality of virtualdisks in response to receiving the confirmation from the plurality ofagents.
 33. The system of claim 28, wherein the first replicationcoordinator is configured to receive each log of the logs individuallyfrom the corresponding computing device on the primary side.
 34. Thesystem of claim 28, wherein the write-order consistent storage point isa replica of the first plurality of virtual disks on the primary side ata point in time, the storage access requests applicable to synchronizethe second plurality of virtual disks with the first plurality ofvirtual disks.
 35. A write order consistent tracking (WOCT) coordinator,comprising: at least one processor circuit; and memory that storescomputer executable instructions for operations performed by the atleast one processor circuit, the computer executable instructionsforming program code including: a log switching initiator configured tocommunicate with a plurality of agents at a plurality of computingdevices to initiate cycles of a log switching of logs associated with aplurality of virtual disks at the plurality of computing devices andtake an exclusive lock on each of a plurality of begin stage lock files;and a log switching manager configured to coordinate the cycles of thelog switching of the logs at the plurality of computing devices tomaintain request ordering for write order dependent requests acrossvirtual disks; wherein during a stage of a cycle of log switching, thelog switching manager is configured to: abort the log switching if atleast one of the agents does not respond with an awaited response withina predetermined time period for the plurality of stages to be completed;and transition to enacting a next stage of the cycle of log switching ifall agents respond within the predetermined time period.
 36. The WOCTcoordinator of claim 35, wherein a cycle of the log switching includes aswitching out of each current log for a corresponding new log, and eachcomputing device of the plurality of computing devices includes at leastone of a virtual disk that receives storage access requests from anapplication, the storage access requests including write requests, and alog corresponding to the virtual disk that receives log queue entriescorresponding to the storage access requests.
 37. The WOCT coordinatorof claim 35, wherein, for the cycle of the log switching, the logswitching initiator is configured to: transmit a log switchinginitiation instruction to the plurality of agents at the computingdevices to initiate the log switching.
 38. The WOCT coordinator of claim37, wherein the log switching initiator is configured to receive aresponse from each of the agents, each response received from an agentof the plurality agents indicating that the agent took a shared lock oneach of a plurality of end stage lock files.
 39. The WOCT coordinator ofclaim 38, wherein during the stage, the log switching manager isconfigured to: release the exclusive lock taken by the log switchinginitiator on the associated begin stage lock file to signal thebeginning of the stage to the agents, attempt to take an exclusive lockon the associated end stage lock file, take the exclusive lock on theassociated end stage lock file when enabled by the agents havingreleased all shared locks on the associated end stage lock file tosignify completion of the stage by the agents, and transition toenacting a next stage until a final stage of the plurality of stages iscompleted.
 40. The WOCT coordinator of claim 35, wherein the logswitching manager is configured to: enact a plurality of stages to causea cycle of the switching of the logs at the plurality of computingdevices, the log switching being completed when a final stage of theplurality of stages is completed; and wherein the awaited response is aresponse to a control code transmitted to the agents.